[PoC] Improve dead tuple storage for lazy vacuum
Hi all,
Index vacuuming is one of the most time-consuming processes in lazy
vacuuming. lazy_tid_reaped() is a large part among them. The attached
the flame graph shows a profile of a vacuum on a table that has one index
and 80 million live rows and 20 million dead rows, where
lazy_tid_reaped() accounts for about 47% of the total vacuum execution
time.
lazy_tid_reaped() is essentially an existence check; for every index
tuple, it checks if the TID of the heap it points to exists in the set
of TIDs of dead tuples. The maximum size of dead tuples is limited by
maintenance_work_mem, and if the upper limit is reached, the heap scan
is suspended, index vacuum and heap vacuum are performed, and then
heap scan is resumed again. Therefore, in terms of the performance of
index vacuuming, there are two important factors: the performance of
lookup TIDs from the set of dead tuples and its memory usage. The
former is obvious whereas the latter affects the number of Index
vacuuming. In many index AMs, index vacuuming (i.e., ambulkdelete)
performs a full scan of the index, so it is important in terms of
performance to avoid index vacuuming from being executed more than
once during lazy vacuum.
Currently, the TIDs of dead tuples are stored in an array that is
collectively allocated at the start of lazy vacuum and TID lookup uses
bsearch(). There are the following challenges and limitations:
1. Don't allocate more than 1GB. There was a discussion to eliminate
this limitation by using MemoryContextAllocHuge() but there were
concerns about point 2[1]/messages/by-id/CAGTBQpbDCaR6vv9=scXzuT8fSbckf=a3NgZdWFWZbdVugVht6Q@mail.gmail.com.
2. Allocate the whole memory space at once.
3. Slow lookup performance (O(logN)).
I’ve done some experiments in this area and would like to share the
results and discuss ideas.
Problems Solutions
===============
Firstly, I've considered using existing data structures:
IntegerSet(src/backend/lib/integerset.c) and
TIDBitmap(src/backend/nodes/tidbitmap.c). Those address point 1 but
only either point 2 or 3. IntegerSet uses lower memory thanks to
simple-8b encoding but is slow at lookup, still O(logN), since it’s a
tree structure. On the other hand, TIDBitmap has a good lookup
performance, O(1), but could unnecessarily use larger memory in some
cases since it always allocates the space for bitmap enough to store
all possible offsets. With 8kB blocks, the maximum number of line
pointers in a heap page is 291 (c.f., MaxHeapTuplesPerPage) so the
bitmap is 40 bytes long and we always need 46 bytes in total per block
including other meta information.
So I prototyped a new data structure dedicated to storing dead tuples
during lazy vacuum while borrowing the idea from Roaring Bitmap[2]http://roaringbitmap.org/.
The authors provide an implementation of Roaring Bitmap[3]https://github.com/RoaringBitmap/CRoaring (Apache
2.0 license). But I've implemented this idea from scratch because we
need to integrate it with Dynamic Shared Memory/Area to support
parallel vacuum and need to support ItemPointerData, 6-bytes integer
in total, whereas the implementation supports only 4-bytes integers.
Also, when it comes to vacuum, we neither need to compute the
intersection, the union, nor the difference between sets, but need
only an existence check.
The data structure is somewhat similar to TIDBitmap. It consists of
the hash table and the container area; the hash table has entries per
block and each block entry allocates its memory space, called a
container, in the container area to store its offset numbers. The
container area is actually an array of bytes and can be enlarged as
needed. In the container area, the data representation of offset
numbers varies depending on their cardinality. It has three container
types: array, bitmap, and run.
For example, if there are two dead tuples at offset 1 and 150, it uses
the array container that has an array of two 2-byte integers
representing 1 and 150, using 4 bytes in total. If we used the bitmap
container in this case, we would need 20 bytes instead. On the other
hand, if there are consecutive 20 dead tuples from offset 1 to 20, it
uses the run container that has an array of 2-byte integers. The first
value in each pair represents a starting offset number, whereas the
second value represents its length. Therefore, in this case, the run
container uses only 4 bytes in total. Finally, if there are dead
tuples at every other offset from 1 to 100, it uses the bitmap
container that has an uncompressed bitmap, using 13 bytes. We need
another 16 bytes per block entry for hash table entry.
The lookup complexity of a bitmap container is O(1) whereas the one of
an array and a run container is O(N) or O(logN) but the number of
elements in those two containers should not be large it would not be a
problem.
Evaluation
========
Before implementing this idea and integrating it with lazy vacuum
code, I've implemented a benchmark tool dedicated to evaluating
lazy_tid_reaped() performance[4]https://github.com/MasahikoSawada/pgtools/tree/master/bdbench. It has some functions: generating
TIDs for both index tuples and dead tuples, loading dead tuples to the
data structure, simulating lazy_tid_reaped() using those virtual heap
tuples and heap dead tuples. So the code lacks many features such as
iteration and DSM/DSA support but it makes testing of data structure
easier.
FYI I've confirmed the validity of this tool. When I ran a vacuum on
the table with 3GB size, index vacuuming took 12.3 sec and
lazy_tid_reaped() took approximately 8.5 sec. Simulating a similar
situation with the tool, the lookup benchmark with the array data
structure took approximately 8.0 sec. Given that the tool doesn't
simulate the cost of function calls, it seems to reasonably simulate
it.
I've evaluated the lookup performance and memory foot point against
the four types of data structure: array, integerset (intset),
tidbitmap (tbm), roaring tidbitmap (rtbm) while changing the
distribution of dead tuples in blocks. Since tbm doesn't have a
function for existence check I've added it and allocate enough memory
to make sure that tbm never be lossy during the evaluation. In all
test cases, I simulated that the table has 1,000,000 blocks and every
block has at least one dead tuple. The benchmark scenario is that for
each virtual heap tuple we check if there is its TID in the dead
tuple storage. Here are the results of execution time in milliseconds
and memory usage in bytes:
* Test-case 1 (10 dead tuples in 20 offsets interval)
An array container is selected in this test case, using 20 bytes for each block.
Execution Time Memory Usage
array 14,140.91 60,008,248
intset 9,350.08 50,339,840
tbm 1,299.62 100,671,544
rtbm 1,892.52 58,744,944
* Test-case 2 (10 consecutive dead tuples from offset 1)
A bitmap container is selected in this test case, using 2 bytes for each block.
Execution Time Memory Usage
array 1,056.60 60,008,248
intset 650.85 50,339,840
tbm 194.61 100,671,544
rtbm 154.57 27,287,664
* Test-case 3 (2 dead tuples at 1 and 100 offsets)
An array container is selected in this test case, using 4 bytes for
each block. Since 'array' data structure (not array container of rtbm)
uses only 12 bytes for each block, given that the size of hash table
entry size in 'rtbm', 'array' data structure uses less memory.
Execution Time Memory Usage
array 6,054.22 12,008,248
intset 4,203.41 16,785,408
tbm 759.17 100,671,544
rtbm 750.08 29,384,816
* Test-case 4 (100 consecutive dead tuples from 1)
A run container is selected in this test case, using 4 bytes for each block.
Execution Time Memory Usage
array 8,883.03 600,008,248
intset 7,358.23 100,671,488
tbm 758.81 100,671,544
rtbm 764.33 29,384,816
Overall, 'rtbm' has a much better lookup performance and good memory
usage especially if there are relatively many dead tuples. However, in
some cases, 'intset' and 'array' have a better memory usage.
Feedback is very welcome. Thank you for reading the email through to the end.
Regards,
[1]: /messages/by-id/CAGTBQpbDCaR6vv9=scXzuT8fSbckf=a3NgZdWFWZbdVugVht6Q@mail.gmail.com
[2]: http://roaringbitmap.org/
[3]: https://github.com/RoaringBitmap/CRoaring
[4]: https://github.com/MasahikoSawada/pgtools/tree/master/bdbench
--
Masahiko Sawada
EDB: https://www.enterprisedb.com/
Attachments:
On Wed, 7 Jul 2021 at 13:47, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
Hi all,
Index vacuuming is one of the most time-consuming processes in lazy
vacuuming. lazy_tid_reaped() is a large part among them. The attached
the flame graph shows a profile of a vacuum on a table that has one index
and 80 million live rows and 20 million dead rows, where
lazy_tid_reaped() accounts for about 47% of the total vacuum execution
time.[...]
Overall, 'rtbm' has a much better lookup performance and good memory
usage especially if there are relatively many dead tuples. However, in
some cases, 'intset' and 'array' have a better memory usage.
Those are some great results, with a good path to meaningful improvements.
Feedback is very welcome. Thank you for reading the email through to the end.
The current available infrastructure for TIDs is quite ill-defined for
TableAM authors [0]/messages/by-id/0bbeb784050503036344e1f08513f13b2083244b.camel@j-davis.com, and other TableAMs might want to use more than
just the 11 bits in use by max-BLCKSZ HeapAM MaxHeapTuplesPerPage to
identify tuples. (MaxHeapTuplesPerPage is 1169 at the maximum 32k
BLCKSZ, which requires 11 bits to fit).
Could you also check what the (performance, memory) impact would be if
these proposed structures were to support the maximum
MaxHeapTuplesPerPage of 1169 or the full uint16-range of offset
numbers that could be supported by our current TID struct?
Kind regards,
Matthias van de Meent
[0]: /messages/by-id/0bbeb784050503036344e1f08513f13b2083244b.camel@j-davis.com
On Wed, Jul 7, 2021 at 4:47 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
Currently, the TIDs of dead tuples are stored in an array that is
collectively allocated at the start of lazy vacuum and TID lookup uses
bsearch(). There are the following challenges and limitations:1. Don't allocate more than 1GB. There was a discussion to eliminate
this limitation by using MemoryContextAllocHuge() but there were
concerns about point 2[1].
I think that the main problem with the 1GB limitation is that it is
surprising -- it can cause disruption when we first exceed the magical
limit of ~174 million TIDs. This can cause us to dirty index pages a
second time when we might have been able to just do it once with
sufficient memory for TIDs. OTOH there are actually cases where having
less memory for TIDs makes performance *better* because of locality
effects. This perverse behavior with memory sizing isn't a rare case
that we can safely ignore -- unfortunately it's fairly common.
My point is that we should be careful to choose the correct goal.
Obviously memory use matters. But it might be more helpful to think of
memory use as just a proxy for what truly matters, not a goal in
itself. It's hard to know what this means (what is the "real goal"?),
and hard to measure it even if you know for sure. It could still be
useful to think of it like this.
A run container is selected in this test case, using 4 bytes for each block.
Execution Time Memory Usage
array 8,883.03 600,008,248
intset 7,358.23 100,671,488
tbm 758.81 100,671,544
rtbm 764.33 29,384,816Overall, 'rtbm' has a much better lookup performance and good memory
usage especially if there are relatively many dead tuples. However, in
some cases, 'intset' and 'array' have a better memory usage.
This seems very promising.
I wonder how much you have thought about the index AM side. It makes
sense to initially evaluate these techniques using this approach of
separating the data structure from how it is used by VACUUM -- I think
that that was a good idea. But at the same time there may be certain
important theoretical questions that cannot be answered this way --
questions about how everything "fits together" in a real VACUUM might
matter a lot. You've probably thought about this at least a little
already. Curious to hear how you think it "fits together" with the
work that you've done already.
The loop inside btvacuumpage() makes each loop iteration call the
callback -- this is always a call to lazy_tid_reaped() in practice.
And that's where we do binary searches. These binary searches are
usually where we see a huge number of cycles spent when we look at
profiles, including the profile that produced your flame graph. But I
worry that that might be a bit misleading -- the way that profilers
attribute costs is very complicated and can never be fully trusted.
While it is true that lazy_tid_reaped() often accesses main memory,
which will of course add a huge amount of latency and make it a huge
bottleneck, the "big picture" is still relevant.
I think that the compiler currently has to make very conservative
assumptions when generating the machine code used by the loop inside
btvacuumpage(), which calls through an opaque function pointer at
least once per loop iteration -- anything can alias, so the compiler
must be conservative. The data dependencies are hard for both the
compiler and the CPU to analyze. The cost of using a function pointer
compared to a direct function call is usually quite low, but there are
important exceptions -- cases where it prevents other useful
optimizations. Maybe this is an exception.
I wonder how much it would help to break up that loop into two loops.
Make the callback into a batch operation that generates state that
describes what to do with each and every index tuple on the leaf page.
The first loop would build a list of TIDs, then you'd call into
vacuumlazy.c and get it to process the TIDs, and finally the second
loop would physically delete the TIDs that need to be deleted. This
would mean that there would be only one call per leaf page per
btbulkdelete(). This would reduce the number of calls to the callback
by at least 100x, and maybe more than 1000x.
This approach would make btbulkdelete() similar to
_bt_simpledel_pass() + _bt_delitems_delete_check(). This is not really
an independent idea to your ideas -- I imagine that this would work
far better when combined with a more compact data structure, which is
naturally more capable of batch processing than a simple array of
TIDs. Maybe this will help the compiler and the CPU to fully
understand the *natural* data dependencies, so that they can be as
effective as possible in making the code run fast. It's possible that
a modern CPU will be able to *hide* the latency more intelligently
than what we have today. The latency is such a big problem that we may
be able to justify "wasting" other CPU resources, just because it
sometimes helps with hiding the latency. For example, it might
actually be okay to sort all of the TIDs on the page to make the bulk
processing work -- though you might still do a precheck that is
similar to the precheck inside lazy_tid_reaped() that was added by you
in commit bbaf315309e.
Of course it's very easy to be wrong about stuff like this. But it
might not be that hard to prototype. You can literally copy and paste
code from _bt_delitems_delete_check() to do this. It does the same
basic thing already.
--
Peter Geoghegan
On Wed, Jul 7, 2021 at 1:24 PM Peter Geoghegan <pg@bowt.ie> wrote:
I wonder how much it would help to break up that loop into two loops.
Make the callback into a batch operation that generates state that
describes what to do with each and every index tuple on the leaf page.
The first loop would build a list of TIDs, then you'd call into
vacuumlazy.c and get it to process the TIDs, and finally the second
loop would physically delete the TIDs that need to be deleted. This
would mean that there would be only one call per leaf page per
btbulkdelete(). This would reduce the number of calls to the callback
by at least 100x, and maybe more than 1000x.
Maybe for something like rtbm.c (which is inspired by Roaring
bitmaps), you would really want to use an "intersection" operation for
this. The TIDs that we need to physically delete from the leaf page
inside btvacuumpage() are the intersection of two bitmaps: our bitmap
of all TIDs on the leaf page, and our bitmap of all TIDs that need to
be deleting by the ongoing btbulkdelete() call.
Obviously the typical case is that most TIDs in the index do *not* get
deleted -- needing to delete more than ~20% of all TIDs in the index
will be rare. Ideally it would be very cheap to figure out that a TID
does not need to be deleted at all. Something a little like a negative
cache (but not a true negative cache). This is a little bit like how
hash joins can be made faster by adding a Bloom filter -- most hash
probes don't need to join a tuple in the real world, and we can make
these hash probes even faster by using a Bloom filter as a negative
cache.
If you had the list of TIDs from a leaf page sorted for batch
processing, and if you had roaring bitmap style "chunks" with
"container" metadata stored in the data structure, you could then use
merging/intersection -- that has some of the same advantages. I think
that this would be a lot more efficient than having one binary search
per TID. Most TIDs from the leaf page can be skipped over very
quickly, in large groups. It's very rare for VACUUM to need to delete
TIDs from completely random heap table blocks in the real world (some
kind of pattern is much more common).
When this merging process finds 1 TID that might really be deletable
then it's probably going to find much more than 1 -- better to make
that cache miss take care of all of the TIDs together. Also seems like
the CPU could do some clever prefetching with this approach -- it
could prefetch TIDs where the initial chunk metadata is insufficient
to eliminate them early -- these are the groups of TIDs that will have
many TIDs that we actually need to delete. ISTM that improving
temporal locality through batching could matter a lot here.
--
Peter Geoghegan
On Wed, Jul 7, 2021 at 11:25 PM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
On Wed, 7 Jul 2021 at 13:47, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
Hi all,
Index vacuuming is one of the most time-consuming processes in lazy
vacuuming. lazy_tid_reaped() is a large part among them. The attached
the flame graph shows a profile of a vacuum on a table that has one index
and 80 million live rows and 20 million dead rows, where
lazy_tid_reaped() accounts for about 47% of the total vacuum execution
time.[...]
Overall, 'rtbm' has a much better lookup performance and good memory
usage especially if there are relatively many dead tuples. However, in
some cases, 'intset' and 'array' have a better memory usage.Those are some great results, with a good path to meaningful improvements.
Feedback is very welcome. Thank you for reading the email through to the end.
The current available infrastructure for TIDs is quite ill-defined for
TableAM authors [0], and other TableAMs might want to use more than
just the 11 bits in use by max-BLCKSZ HeapAM MaxHeapTuplesPerPage to
identify tuples. (MaxHeapTuplesPerPage is 1169 at the maximum 32k
BLCKSZ, which requires 11 bits to fit).Could you also check what the (performance, memory) impact would be if
these proposed structures were to support the maximum
MaxHeapTuplesPerPage of 1169 or the full uint16-range of offset
numbers that could be supported by our current TID struct?
I think tbm will be the most affected by the memory impact of the
larger maximum MaxHeapTuplesPerPage. For example, with 32kB blocks
(MaxHeapTuplesPerPage = 1169), even if there is only one dead tuple in
a block, it will always require at least 147 bytes per block.
Rtbm chooses the container type among array, bitmap, or run depending
on the number and distribution of dead tuples in a block, and only
bitmap containers can be searched with O(1). Run containers depend on
the distribution of dead tuples within a block. So let’s compare array
and bitmap containers.
With 8kB blocks (MaxHeapTuplesPerPage = 291), 36 bytes are needed for
a bitmap container at maximum. In other words, when compared to an
array container, bitmap will be chosen if there are more than 18 dead
tuples in a block. On the other hand, with 32kB blocks
(MaxHeapTuplesPerPage = 1169), 147 bytes are needed for a bitmap
container at maximum, so bitmap container will be chosen if there are
more than 74 dead tuples in a block. And, with full uint16-range
(MaxHeapTuplesPerPage = 65535), 8192 bytes are needed at maximum, so
bitmap container will be chosen if there are more than 4096 dead
tuples in a block. Therefore, in any case, if more than about 6% of
tuples in a block are garbage, a bitmap container will be chosen and
bring a faster lookup performance. (Of course, if a run container is
chosen, the container size gets smaller but the lookup performance is
O(logN).) But if the number of dead tuples in the table is small and
we have the larger MaxHeapTuplesPerPage, it’s likely to choose an
array container, and the lookup performance becomes O(logN). Still, it
should be faster than the array data structure because the range of
search targets in an array container is much smaller.
Regards,
--
Masahiko Sawada
EDB: https://www.enterprisedb.com/
On Thu, Jul 8, 2021 at 5:24 AM Peter Geoghegan <pg@bowt.ie> wrote:
On Wed, Jul 7, 2021 at 4:47 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
Currently, the TIDs of dead tuples are stored in an array that is
collectively allocated at the start of lazy vacuum and TID lookup uses
bsearch(). There are the following challenges and limitations:1. Don't allocate more than 1GB. There was a discussion to eliminate
this limitation by using MemoryContextAllocHuge() but there were
concerns about point 2[1].I think that the main problem with the 1GB limitation is that it is
surprising -- it can cause disruption when we first exceed the magical
limit of ~174 million TIDs. This can cause us to dirty index pages a
second time when we might have been able to just do it once with
sufficient memory for TIDs. OTOH there are actually cases where having
less memory for TIDs makes performance *better* because of locality
effects. This perverse behavior with memory sizing isn't a rare case
that we can safely ignore -- unfortunately it's fairly common.My point is that we should be careful to choose the correct goal.
Obviously memory use matters. But it might be more helpful to think of
memory use as just a proxy for what truly matters, not a goal in
itself. It's hard to know what this means (what is the "real goal"?),
and hard to measure it even if you know for sure. It could still be
useful to think of it like this.
As I wrote in the first email, I think there are two important factors
in index vacuuming performance: the performance to check if heap TID
that an index tuple points to is dead, and the number of times to
perform index bulk-deletion. The flame graph I attached in the first
mail shows CPU spent much time on lazy_tid_reaped() but vacuum is a
disk-intensive operation in practice. Given that most index AM's
bulk-deletion does a full index scan and a table could have multiple
indexes, reducing the number of times to perform index bulk-deletion
really contributes to reducing the execution time, especially for
large tables. I think that a more compact data structure for dead
tuple TIDs is one of the ways to achieve that.
A run container is selected in this test case, using 4 bytes for each block.
Execution Time Memory Usage
array 8,883.03 600,008,248
intset 7,358.23 100,671,488
tbm 758.81 100,671,544
rtbm 764.33 29,384,816Overall, 'rtbm' has a much better lookup performance and good memory
usage especially if there are relatively many dead tuples. However, in
some cases, 'intset' and 'array' have a better memory usage.This seems very promising.
I wonder how much you have thought about the index AM side. It makes
sense to initially evaluate these techniques using this approach of
separating the data structure from how it is used by VACUUM -- I think
that that was a good idea. But at the same time there may be certain
important theoretical questions that cannot be answered this way --
questions about how everything "fits together" in a real VACUUM might
matter a lot. You've probably thought about this at least a little
already. Curious to hear how you think it "fits together" with the
work that you've done already.
Yeah, that definitely needs to be considered. Currently, what we need
for the dead tuple storage for lazy vacuum are store, lookup, and
iteration. And given the parallel vacuum, it has to be able to be
allocated on DSM or DSA. While implementing the PoC code, I'm trying
to integrate it with the current lazy vacuum code. As far as I've seen
so far, the integration is not hard, at least with the *current* lazy
vacuum code and index AMs code.
The loop inside btvacuumpage() makes each loop iteration call the
callback -- this is always a call to lazy_tid_reaped() in practice.
And that's where we do binary searches. These binary searches are
usually where we see a huge number of cycles spent when we look at
profiles, including the profile that produced your flame graph. But I
worry that that might be a bit misleading -- the way that profilers
attribute costs is very complicated and can never be fully trusted.
While it is true that lazy_tid_reaped() often accesses main memory,
which will of course add a huge amount of latency and make it a huge
bottleneck, the "big picture" is still relevant.I think that the compiler currently has to make very conservative
assumptions when generating the machine code used by the loop inside
btvacuumpage(), which calls through an opaque function pointer at
least once per loop iteration -- anything can alias, so the compiler
must be conservative. The data dependencies are hard for both the
compiler and the CPU to analyze. The cost of using a function pointer
compared to a direct function call is usually quite low, but there are
important exceptions -- cases where it prevents other useful
optimizations. Maybe this is an exception.I wonder how much it would help to break up that loop into two loops.
Make the callback into a batch operation that generates state that
describes what to do with each and every index tuple on the leaf page.
The first loop would build a list of TIDs, then you'd call into
vacuumlazy.c and get it to process the TIDs, and finally the second
loop would physically delete the TIDs that need to be deleted. This
would mean that there would be only one call per leaf page per
btbulkdelete(). This would reduce the number of calls to the callback
by at least 100x, and maybe more than 1000x.This approach would make btbulkdelete() similar to
_bt_simpledel_pass() + _bt_delitems_delete_check(). This is not really
an independent idea to your ideas -- I imagine that this would work
far better when combined with a more compact data structure, which is
naturally more capable of batch processing than a simple array of
TIDs. Maybe this will help the compiler and the CPU to fully
understand the *natural* data dependencies, so that they can be as
effective as possible in making the code run fast. It's possible that
a modern CPU will be able to *hide* the latency more intelligently
than what we have today. The latency is such a big problem that we may
be able to justify "wasting" other CPU resources, just because it
sometimes helps with hiding the latency. For example, it might
actually be okay to sort all of the TIDs on the page to make the bulk
processing work -- though you might still do a precheck that is
similar to the precheck inside lazy_tid_reaped() that was added by you
in commit bbaf315309e.
Interesting idea. I remember you mentioned this idea somewhere and
I've considered this idea too while implementing the PoC code. It's
definitely worth trying. Maybe we can write a patch for this as a
separate patch? It will change index AM and could improve also the
current bulk-deletion. We can consider a better data structure on top
of this idea.
Regards,
--
Masahiko Sawada
EDB: https://www.enterprisedb.com/
Very nice results.
I have been working on the same problem but a bit different solution -
a mix of binary search for (sub)pages and 32-bit bitmaps for
tid-in-page.
Even with currebnt allocation heuristics (allocate 291 tids per page)
it initially allocate much less space, instead of current 291*6=1746
bytes per page it needs to allocate 80 bytes.
Also it can be laid out so that it is friendly to parallel SIMD
searches doing up to 8 tid lookups in parallel.
That said, for allocating the tid array, the best solution is to
postpone it as much as possible and to do the initial collection into
a file, which
1) postpones the memory allocation to the beginning of index cleanups
2) lets you select the correct size and structure as you know more
about the distribution at that time
3) do the first heap pass in one go and then advance frozenxmin
*before* index cleanup
Also, collecting dead tids into a file makes it trivial (well, almost
:) ) to parallelize the initial heap scan, so more resources can be
thrown at it if available.
Cheers
-----
Hannu Krosing
Google Cloud - We have a long list of planned contributions and we are hiring.
Contact me if interested.
Show quoted text
On Thu, Jul 8, 2021 at 10:48 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Thu, Jul 8, 2021 at 5:24 AM Peter Geoghegan <pg@bowt.ie> wrote:
On Wed, Jul 7, 2021 at 4:47 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
Currently, the TIDs of dead tuples are stored in an array that is
collectively allocated at the start of lazy vacuum and TID lookup uses
bsearch(). There are the following challenges and limitations:1. Don't allocate more than 1GB. There was a discussion to eliminate
this limitation by using MemoryContextAllocHuge() but there were
concerns about point 2[1].I think that the main problem with the 1GB limitation is that it is
surprising -- it can cause disruption when we first exceed the magical
limit of ~174 million TIDs. This can cause us to dirty index pages a
second time when we might have been able to just do it once with
sufficient memory for TIDs. OTOH there are actually cases where having
less memory for TIDs makes performance *better* because of locality
effects. This perverse behavior with memory sizing isn't a rare case
that we can safely ignore -- unfortunately it's fairly common.My point is that we should be careful to choose the correct goal.
Obviously memory use matters. But it might be more helpful to think of
memory use as just a proxy for what truly matters, not a goal in
itself. It's hard to know what this means (what is the "real goal"?),
and hard to measure it even if you know for sure. It could still be
useful to think of it like this.As I wrote in the first email, I think there are two important factors
in index vacuuming performance: the performance to check if heap TID
that an index tuple points to is dead, and the number of times to
perform index bulk-deletion. The flame graph I attached in the first
mail shows CPU spent much time on lazy_tid_reaped() but vacuum is a
disk-intensive operation in practice. Given that most index AM's
bulk-deletion does a full index scan and a table could have multiple
indexes, reducing the number of times to perform index bulk-deletion
really contributes to reducing the execution time, especially for
large tables. I think that a more compact data structure for dead
tuple TIDs is one of the ways to achieve that.A run container is selected in this test case, using 4 bytes for each block.
Execution Time Memory Usage
array 8,883.03 600,008,248
intset 7,358.23 100,671,488
tbm 758.81 100,671,544
rtbm 764.33 29,384,816Overall, 'rtbm' has a much better lookup performance and good memory
usage especially if there are relatively many dead tuples. However, in
some cases, 'intset' and 'array' have a better memory usage.This seems very promising.
I wonder how much you have thought about the index AM side. It makes
sense to initially evaluate these techniques using this approach of
separating the data structure from how it is used by VACUUM -- I think
that that was a good idea. But at the same time there may be certain
important theoretical questions that cannot be answered this way --
questions about how everything "fits together" in a real VACUUM might
matter a lot. You've probably thought about this at least a little
already. Curious to hear how you think it "fits together" with the
work that you've done already.Yeah, that definitely needs to be considered. Currently, what we need
for the dead tuple storage for lazy vacuum are store, lookup, and
iteration. And given the parallel vacuum, it has to be able to be
allocated on DSM or DSA. While implementing the PoC code, I'm trying
to integrate it with the current lazy vacuum code. As far as I've seen
so far, the integration is not hard, at least with the *current* lazy
vacuum code and index AMs code.The loop inside btvacuumpage() makes each loop iteration call the
callback -- this is always a call to lazy_tid_reaped() in practice.
And that's where we do binary searches. These binary searches are
usually where we see a huge number of cycles spent when we look at
profiles, including the profile that produced your flame graph. But I
worry that that might be a bit misleading -- the way that profilers
attribute costs is very complicated and can never be fully trusted.
While it is true that lazy_tid_reaped() often accesses main memory,
which will of course add a huge amount of latency and make it a huge
bottleneck, the "big picture" is still relevant.I think that the compiler currently has to make very conservative
assumptions when generating the machine code used by the loop inside
btvacuumpage(), which calls through an opaque function pointer at
least once per loop iteration -- anything can alias, so the compiler
must be conservative. The data dependencies are hard for both the
compiler and the CPU to analyze. The cost of using a function pointer
compared to a direct function call is usually quite low, but there are
important exceptions -- cases where it prevents other useful
optimizations. Maybe this is an exception.I wonder how much it would help to break up that loop into two loops.
Make the callback into a batch operation that generates state that
describes what to do with each and every index tuple on the leaf page.
The first loop would build a list of TIDs, then you'd call into
vacuumlazy.c and get it to process the TIDs, and finally the second
loop would physically delete the TIDs that need to be deleted. This
would mean that there would be only one call per leaf page per
btbulkdelete(). This would reduce the number of calls to the callback
by at least 100x, and maybe more than 1000x.This approach would make btbulkdelete() similar to
_bt_simpledel_pass() + _bt_delitems_delete_check(). This is not really
an independent idea to your ideas -- I imagine that this would work
far better when combined with a more compact data structure, which is
naturally more capable of batch processing than a simple array of
TIDs. Maybe this will help the compiler and the CPU to fully
understand the *natural* data dependencies, so that they can be as
effective as possible in making the code run fast. It's possible that
a modern CPU will be able to *hide* the latency more intelligently
than what we have today. The latency is such a big problem that we may
be able to justify "wasting" other CPU resources, just because it
sometimes helps with hiding the latency. For example, it might
actually be okay to sort all of the TIDs on the page to make the bulk
processing work -- though you might still do a precheck that is
similar to the precheck inside lazy_tid_reaped() that was added by you
in commit bbaf315309e.Interesting idea. I remember you mentioned this idea somewhere and
I've considered this idea too while implementing the PoC code. It's
definitely worth trying. Maybe we can write a patch for this as a
separate patch? It will change index AM and could improve also the
current bulk-deletion. We can consider a better data structure on top
of this idea.Regards,
--
Masahiko Sawada
EDB: https://www.enterprisedb.com/
Resending as forgot to send to the list (thanks Peter :) )
On Wed, Jul 7, 2021 at 10:24 PM Peter Geoghegan <pg@bowt.ie> wrote:
The loop inside btvacuumpage() makes each loop iteration call the
callback -- this is always a call to lazy_tid_reaped() in practice.
And that's where we do binary searches. These binary searches are
usually where we see a huge number of cycles spent when we look at
profiles, including the profile that produced your flame graph. But I
worry that that might be a bit misleading -- the way that profilers
attribute costs is very complicated and can never be fully trusted.
While it is true that lazy_tid_reaped() often accesses main memory,
which will of course add a huge amount of latency and make it a huge
bottleneck, the "big picture" is still relevant.
This is why I have mainly focused on making it possible to use SIMD and
run 4-8 binary searches in parallel, mostly 8, for AVX2.
How I am approaching this is separating "page search" tyo run over a
(naturally) sorted array of 32 bit page pointers and only when the
page is found the indexes in this array are used to look up the
in-page bitmaps.
This allows the heavier bsearch activity to run on smaller range of
memory, hopefully reducing the cache trashing.
There are opportunities to optimise this further for cash hits, buy
collecting the tids from indexes in larger patches and then
constraining the searches in the main is-deleted-bitmap to run over
sections of it, but at some point this becomes a very complex
balancing act, as the manipulation of the bits-to-check from indexes
also takes time, not to mention the need to release the index pages
and then later chase the tid pointers in case they have moved while
checking them.
I have not measured anything yet, but one of my concerns in case of
very large dead tuple collections searched by 8-way parallel bsearch
could actually get close to saturating RAM bandwidth by reading (8 x
32bits x cache-line-size) bytes from main memory every few cycles, so
we may need some inner-loop level throttling similar to current
vacuum_cost_limit for data pages.
I think that the compiler currently has to make very conservative
assumptions when generating the machine code used by the loop inside
btvacuumpage(), which calls through an opaque function pointer at
least once per loop iteration -- anything can alias, so the compiler
must be conservative.
Definitely this! The lookup function needs to be turned into an inline
function or #define as well to give the compiler maximum freedoms.
The data dependencies are hard for both the
compiler and the CPU to analyze. The cost of using a function pointer
compared to a direct function call is usually quite low, but there are
important exceptions -- cases where it prevents other useful
optimizations. Maybe this is an exception.
Yes. Also this could be a place where unrolling the loop could make a
real difference.
Maybe not unrolling the full 32 loops for 32 bit bserach, but
something like 8-loop unroll for getting most of the benefit.
The 32x unroll would not be really that bad for performance if all 32
loops were needed, but mostly we would need to jump into last 10 to 20
loops for lookup min 1000 to 1000000 pages and I suspect this is such
a weird corner case that compiler is really unlikely to have this
optimisation supported. Of course I may be wrong and ith is a common
enough case for the optimiser.
I wonder how much it would help to break up that loop into two loops.
Make the callback into a batch operation that generates state that
describes what to do with each and every index tuple on the leaf page.
The first loop would build a list of TIDs, then you'd call into
vacuumlazy.c and get it to process the TIDs, and finally the second
loop would physically delete the TIDs that need to be deleted. This
would mean that there would be only one call per leaf page per
btbulkdelete(). This would reduce the number of calls to the callback
by at least 100x, and maybe more than 1000x.
While it may make sense to have different bitmap encodings for
different distributions, it likely would not be good for optimisations
if all these are used at the same time.
This is why I propose the first bitmap collecting phase to collect
into a file and then - when reading into memory for lookups phase -
possibly rewrite the initial structure to something else if it sees
that it is more efficient. Like for example where the first half of
the file consists of only empty pages.
This approach would make btbulkdelete() similar to
_bt_simpledel_pass() + _bt_delitems_delete_check(). This is not really
an independent idea to your ideas -- I imagine that this would work
far better when combined with a more compact data structure, which is
naturally more capable of batch processing than a simple array of
TIDs. Maybe this will help the compiler and the CPU to fully
understand the *natural* data dependencies, so that they can be as
effective as possible in making the code run fast. It's possible that
a modern CPU will be able to *hide* the latency more intelligently
than what we have today. The latency is such a big problem that we may
be able to justify "wasting" other CPU resources, just because it
sometimes helps with hiding the latency. For example, it might
actually be okay to sort all of the TIDs on the page to make the bulk
processing work
Then again it may be so much extra work that it starts to dominate
some parts of profiles.
For example see the work that was done in improving the mini-vacuum
part where it was actually faster to copy data out to a separate
buffer and then back in than shuffle it around inside the same 8k page
:)
So only testing will tell.
-- though you might still do a precheck that is
similar to the precheck inside lazy_tid_reaped() that was added by you
in commit bbaf315309e.Of course it's very easy to be wrong about stuff like this. But it
might not be that hard to prototype. You can literally copy and paste
code from _bt_delitems_delete_check() to do this. It does the same
basic thing already.
Also a lot of testing would be needed to figure out which strategy
fits best for which distribution of dead tuples, and possibly their
relation to the order of tuples to check from indexes .
Cheers
--
Hannu Krosing
Google Cloud - We have a long list of planned contributions and we are hiring.
Contact me if interested.
On Thu, Jul 8, 2021 at 1:47 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
As I wrote in the first email, I think there are two important factors
in index vacuuming performance: the performance to check if heap TID
that an index tuple points to is dead, and the number of times to
perform index bulk-deletion. The flame graph I attached in the first
mail shows CPU spent much time on lazy_tid_reaped() but vacuum is a
disk-intensive operation in practice.
Maybe. But I recently bought an NVME SSD that can read at over
6GB/second. So "disk-intensive" is not what it used to be -- at least
not for reads. In general it's not good if we do multiple scans of an
index -- no question. But there is a danger in paying a little too
much attention to what is true in general -- we should not ignore what
might be true in specific cases either. Maybe we can solve some
problems by spilling the TID data structure to disk -- if we trade
sequential I/O for random I/O, we may be able to do only one pass over
the index (especially when we have *almost* enough memory to fit all
TIDs, but not quite enough).
The big problem with multiple passes over the index is not the extra
read bandwidth -- it's the extra page dirtying (writes), especially
with things like indexes on UUID columns. We want to dirty each leaf
page in each index at most once per VACUUM, and should be willing to
pay some cost in order to get a larger benefit with page dirtying.
After all, writes are much more expensive on modern flash devices --
if we have to do more random read I/O to spill the TIDs then that
might actually be 100% worth it. And, we don't need much memory for
something that works well as a negative cache, either -- so maybe the
extra random read I/O needed to spill the TIDs will be very limited
anyway.
There are many possibilities. You can probably think of other
trade-offs yourself. We could maybe use a cost model for all this --
it is a little like a hash join IMV. This is just something to think
about while refining the design.
Interesting idea. I remember you mentioned this idea somewhere and
I've considered this idea too while implementing the PoC code. It's
definitely worth trying. Maybe we can write a patch for this as a
separate patch? It will change index AM and could improve also the
current bulk-deletion. We can consider a better data structure on top
of this idea.
I'm happy to write it as a separate patch, either by leaving it to you
or by collaborating directly. It's not necessary to tie it to the
first patch. But at the same time it is highly related to what you're
already doing.
As I said I am totally prepared to be wrong here. But it seems worth
it to try. In Postgres 14, the _bt_delitems_vacuum() function (which
actually carries out VACUUM's physical page modifications to a leaf
page) is almost identical to _bt_delitems_delete(). And
_bt_delitems_delete() was already built with these kinds of problems
in mind -- it batches work to get the most out of synchronizing with
distant state describing which tuples to delete. It's not exactly the
same situation, but it's *kinda* similar. More importantly, it's a
relatively cheap and easy experiment to run, since we already have
most of what we need (we can take it from
_bt_delitems_delete_check()).
Usually this kind of micro optimization is not very valuable -- 99.9%+
of all code just isn't that sensitive to having the right
optimizations. But this is one of the rare important cases where we
really should look at the raw machine code, and do some kind of
microarchitectural level analysis through careful profiling, using
tools like perf. The laws of physics (or electronic engineering) make
it inevitable that searching for TIDs to match is going to be kind of
slow. But we should at least make sure that we use every trick
available to us to reduce the bottleneck, since it really does matter
a lot to users. Users should be able to expect that this code will at
least be as fast as the hardware that they paid for can allow (or
close to it). There is a great deal of microarchitectural
sophistication with modern CPUs, much of which is designed to make
problems like this one less bad [1]https://www.agner.org/optimize/microarchitecture.pdf -- Peter Geoghegan.
[1]: https://www.agner.org/optimize/microarchitecture.pdf -- Peter Geoghegan
--
Peter Geoghegan
On Thu, Jul 8, 2021 at 1:53 PM Hannu Krosing <hannuk@google.com> wrote:
How I am approaching this is separating "page search" tyo run over a
(naturally) sorted array of 32 bit page pointers and only when the
page is found the indexes in this array are used to look up the
in-page bitmaps.
This allows the heavier bsearch activity to run on smaller range of
memory, hopefully reducing the cache trashing.
I think that the really important thing is to figure out roughly the
right data structure first.
There are opportunities to optimise this further for cash hits, buy
collecting the tids from indexes in larger patches and then
constraining the searches in the main is-deleted-bitmap to run over
sections of it, but at some point this becomes a very complex
balancing act, as the manipulation of the bits-to-check from indexes
also takes time, not to mention the need to release the index pages
and then later chase the tid pointers in case they have moved while
checking them.
I would say that 200 TIDs per leaf page is common and ~1350 TIDs per
leaf page is not uncommon (with deduplication). Seems like that might
be enough?
I have not measured anything yet, but one of my concerns in case of
very large dead tuple collections searched by 8-way parallel bsearch
could actually get close to saturating RAM bandwidth by reading (8 x
32bits x cache-line-size) bytes from main memory every few cycles, so
we may need some inner-loop level throttling similar to current
vacuum_cost_limit for data pages.
If it happens then it'll be a nice problem to have, I suppose.
Maybe not unrolling the full 32 loops for 32 bit bserach, but
something like 8-loop unroll for getting most of the benefit.
My current assumption is that we're bound by memory speed right now,
and that that is the big bottleneck to eliminate -- we must keep the
CPU busy with data to process first. That seems like the most
promising thing to focus on right now.
While it may make sense to have different bitmap encodings for
different distributions, it likely would not be good for optimisations
if all these are used at the same time.
To some degree designs like Roaring bitmaps are just that -- a way of
dynamically figuring out which strategy to use based on data
characteristics.
This is why I propose the first bitmap collecting phase to collect
into a file and then - when reading into memory for lookups phase -
possibly rewrite the initial structure to something else if it sees
that it is more efficient. Like for example where the first half of
the file consists of only empty pages.
Yeah, I agree that something like that could make sense. Although
rewriting it doesn't seem particularly promising, since we can easily
make it cheap to process any TID that falls into a range of blocks
that have no dead tuples. We don't need to rewrite the data structure
to make it do that well, AFAICT.
When I said that I thought of this a little like a hash join, I was
being more serious than you might imagine. Note that the number of
index tuples that VACUUM will delete from each index can now be far
less than the total number of TIDs stored in memory. So even when we
have (say) 20% of all of the TIDs from the table in our in memory list
managed by vacuumlazy.c, it's now quite possible that VACUUM will only
actually "match"/"join" (i.e. delete) as few as 2% of the index tuples
it finds in the index (there really is no way to predict how many).
The opportunistic deletion stuff could easily be doing most of the
required cleanup in an eager fashion following recent improvements --
VACUUM need only take care of "floating garbage" these days. In other
words, thinking about this as something that is a little bit like a
hash join makes sense because hash joins do very well with high join
selectivity, and high join selectivity is common in the real world.
The intersection of TIDs from each leaf page with the in-memory TID
delete structure will often be very small indeed.
Then again it may be so much extra work that it starts to dominate
some parts of profiles.For example see the work that was done in improving the mini-vacuum
part where it was actually faster to copy data out to a separate
buffer and then back in than shuffle it around inside the same 8k page
Some of what I'm saying is based on the experience of improving
similar code used by index tuple deletion in Postgres 14. That did
quite a lot of sorting of TIDs and things like that. In the end the
sorting had no more than a negligible impact on performance. What
really mattered was that we efficiently coordinate with distant heap
pages that describe which index tuples we can delete from a given leaf
page. Sorting hundreds of TIDs is cheap. Reading hundreds of random
locations in memory (or even far fewer) is not so cheap. It might even
be very slow indeed. Sorting in order to batch could end up looking
like cheap insurance that we should be glad to pay for.
So only testing will tell.
True.
--
Peter Geoghegan
On Fri, Jul 9, 2021 at 12:34 AM Peter Geoghegan <pg@bowt.ie> wrote:
...
I would say that 200 TIDs per leaf page is common and ~1350 TIDs per
leaf page is not uncommon (with deduplication). Seems like that might
be enough?
Likely yes, and also it would have the nice property of not changing
the index page locking behaviour.
Are deduplicated tids in the leaf page already sorted in heap order ?
This could potentially simplify / speed up the sort.
I have not measured anything yet, but one of my concerns in case of
very large dead tuple collections searched by 8-way parallel bsearch
could actually get close to saturating RAM bandwidth by reading (8 x
32bits x cache-line-size) bytes from main memory every few cycles, so
we may need some inner-loop level throttling similar to current
vacuum_cost_limit for data pages.If it happens then it'll be a nice problem to have, I suppose.
Maybe not unrolling the full 32 loops for 32 bit bserach, but
something like 8-loop unroll for getting most of the benefit.My current assumption is that we're bound by memory speed right now,
Most likely yes, and this should be also easy to check with manually
unrolling perhaps 4 loops and measuring any speed increase.
and that that is the big bottleneck to eliminate -- we must keep the
CPU busy with data to process first. That seems like the most
promising thing to focus on right now.
This has actually two parts
- trying to make sure that we can make as much as possible from cache
- if we need to get out of cache then try to parallelise this as
much as possible
at the same time we need to watch that we are not making the index
tuple preparation work so heavy that it starts to dominate over memory
access
While it may make sense to have different bitmap encodings for
different distributions, it likely would not be good for optimisations
if all these are used at the same time.To some degree designs like Roaring bitmaps are just that -- a way of
dynamically figuring out which strategy to use based on data
characteristics.
it is, but as I am keeping one eye open for vectorisation, I don't
like when different parts of the same bitmap have radically different
encoding strategies.
This is why I propose the first bitmap collecting phase to collect
into a file and then - when reading into memory for lookups phase -
possibly rewrite the initial structure to something else if it sees
that it is more efficient. Like for example where the first half of
the file consists of only empty pages.Yeah, I agree that something like that could make sense. Although
rewriting it doesn't seem particularly promising,
yeah, I hope to prove (or verify :) ) the structure is good enough so
that it does not need the rewrite.
since we can easily
make it cheap to process any TID that falls into a range of blocks
that have no dead tuples.
I actually meant the opposite case, where we could replace a full 80
bytes 291-bit "all dead" bitmap with just a range - int4 for page and
two int2-s for min and max tid-in page for extra 10x reduction, on top
of original 21x reduction from current 6 bytes / bit encoding to my
page_bsearch_vector bitmaps which encodes one page to maximum of 80
bytes (5 x int4 sub-page pointers + 5 x int4 bitmaps).
I also started out by investigating RoaringBitmaps, but when I
realized that we will likely have to rewrite it anyway I continued
working on getting to a single uniform encoding which fits most use
cases Good Enough and then use that uniformity to enable the compiler
to do its optimisation and hopefully also vectoriziation magic.
We don't need to rewrite the data structure
to make it do that well, AFAICT.When I said that I thought of this a little like a hash join, I was
being more serious than you might imagine. Note that the number of
index tuples that VACUUM will delete from each index can now be far
less than the total number of TIDs stored in memory. So even when we
have (say) 20% of all of the TIDs from the table in our in memory list
managed by vacuumlazy.c, it's now quite possible that VACUUM will only
actually "match"/"join" (i.e. delete) as few as 2% of the index tuples
it finds in the index (there really is no way to predict how many).
The opportunistic deletion stuff could easily be doing most of the
required cleanup in an eager fashion following recent improvements --
VACUUM need only take care of "floating garbage" these days.
Ok, this points to the need to mainly optimise for quite sparse
population of dead tuples, which is still mainly clustered page-wise ?
In other
words, thinking about this as something that is a little bit like a
hash join makes sense because hash joins do very well with high join
selectivity, and high join selectivity is common in the real world.
The intersection of TIDs from each leaf page with the in-memory TID
delete structure will often be very small indeed.
The hard to optimize case is still when we have dead tuple counts in
hundreds of millions, or even billions, like on a HTAP database after
a few hours of OLAP query have accumulated loads of dead tuples in
tables getting heavy OLTP traffic.
There of course we could do a totally different optimisation, where we
also allow reaping tuples newer than the OLAP queries snapshot if we
can prove that when the snapshot moves forward next time, it has to
jump over said transactions making them indeed DEAD and not RECENTLY
DEAD. Currently we let a single OLAP query ruin everything :)
Then again it may be so much extra work that it starts to dominate
some parts of profiles.For example see the work that was done in improving the mini-vacuum
part where it was actually faster to copy data out to a separate
buffer and then back in than shuffle it around inside the same 8k pageSome of what I'm saying is based on the experience of improving
similar code used by index tuple deletion in Postgres 14. That did
quite a lot of sorting of TIDs and things like that. In the end the
sorting had no more than a negligible impact on performance.
Good to know :)
What
really mattered was that we efficiently coordinate with distant heap
pages that describe which index tuples we can delete from a given leaf
page. Sorting hundreds of TIDs is cheap. Reading hundreds of random
locations in memory (or even far fewer) is not so cheap. It might even
be very slow indeed. Sorting in order to batch could end up looking
like cheap insurance that we should be glad to pay for.
If the most expensive operation is sorting a few hundred of tids, then
this should be fast enough.
My worries were more that after the sorting we can not to dsimple
index lookups for them, but each needs to be found via bseach or maybe
even just search if that is faster under some size limit, and that
these could add up. Or some other needed thing that also has to be
done, like allocating extra memory or moving other data around in a
way that CPU does not like.
Cheers
-----
Hannu Krosing
Google Cloud - We have a long list of planned contributions and we are hiring.
Contact me if interested.
Hi,
On 2021-07-07 20:46:38 +0900, Masahiko Sawada wrote:
1. Don't allocate more than 1GB. There was a discussion to eliminate
this limitation by using MemoryContextAllocHuge() but there were
concerns about point 2[1].2. Allocate the whole memory space at once.
3. Slow lookup performance (O(logN)).
I’ve done some experiments in this area and would like to share the
results and discuss ideas.
Yea, this is a serious issue.
3) could possibly be addressed to a decent degree without changing the
fundamental datastructure too much. There's some sizable and trivial
wins by just changing vac_cmp_itemptr() to compare int64s and by using
an open coded bsearch().
The big problem with bsearch isn't imo the O(log(n)) complexity - it's
that it has an abominally bad cache locality. And that can be addressed
https://arxiv.org/ftp/arxiv/papers/1509/1509.05053.pdf
Imo 2) isn't really that a hard problem to improve, even if we were to
stay with the current bsearch approach. Reallocation with an aggressive
growth factor or such isn't that bad.
That's not to say we ought to stay with binary search...
Problems Solutions
===============Firstly, I've considered using existing data structures:
IntegerSet(src/backend/lib/integerset.c) and
TIDBitmap(src/backend/nodes/tidbitmap.c). Those address point 1 but
only either point 2 or 3. IntegerSet uses lower memory thanks to
simple-8b encoding but is slow at lookup, still O(logN), since it’s a
tree structure. On the other hand, TIDBitmap has a good lookup
performance, O(1), but could unnecessarily use larger memory in some
cases since it always allocates the space for bitmap enough to store
all possible offsets. With 8kB blocks, the maximum number of line
pointers in a heap page is 291 (c.f., MaxHeapTuplesPerPage) so the
bitmap is 40 bytes long and we always need 46 bytes in total per block
including other meta information.
Imo tidbitmap isn't particularly good, even in the current use cases -
it's constraining in what we can store (a problem for other AMs), not
actually that dense, the lossy mode doesn't choose what information to
loose well etc.
It'd be nice if we came up with a datastructure that could also replace
the bitmap scan cases.
The data structure is somewhat similar to TIDBitmap. It consists of
the hash table and the container area; the hash table has entries per
block and each block entry allocates its memory space, called a
container, in the container area to store its offset numbers. The
container area is actually an array of bytes and can be enlarged as
needed. In the container area, the data representation of offset
numbers varies depending on their cardinality. It has three container
types: array, bitmap, and run.
Not a huge fan of encoding this much knowledge about the tid layout...
For example, if there are two dead tuples at offset 1 and 150, it uses
the array container that has an array of two 2-byte integers
representing 1 and 150, using 4 bytes in total. If we used the bitmap
container in this case, we would need 20 bytes instead. On the other
hand, if there are consecutive 20 dead tuples from offset 1 to 20, it
uses the run container that has an array of 2-byte integers. The first
value in each pair represents a starting offset number, whereas the
second value represents its length. Therefore, in this case, the run
container uses only 4 bytes in total. Finally, if there are dead
tuples at every other offset from 1 to 100, it uses the bitmap
container that has an uncompressed bitmap, using 13 bytes. We need
another 16 bytes per block entry for hash table entry.The lookup complexity of a bitmap container is O(1) whereas the one of
an array and a run container is O(N) or O(logN) but the number of
elements in those two containers should not be large it would not be a
problem.
Hm. Why is O(N) not an issue? Consider e.g. the case of a table in which
many tuples have been deleted. In cases where the "run" storage is
cheaper (e.g. because there's high offset numbers due to HOT pruning),
we could end up regularly scanning a few hundred entries for a
match. That's not cheap anymore.
Evaluation
========Before implementing this idea and integrating it with lazy vacuum
code, I've implemented a benchmark tool dedicated to evaluating
lazy_tid_reaped() performance[4].
Good idea!
In all test cases, I simulated that the table has 1,000,000 blocks and
every block has at least one dead tuple.
That doesn't strike me as a particularly common scenario? I think it's
quite rare for there to be so evenly but sparse dead tuples. In
particularly it's very common for there to be long runs of dead tuples
separated by long ranges of no dead tuples at all...
The benchmark scenario is that for
each virtual heap tuple we check if there is its TID in the dead
tuple storage. Here are the results of execution time in milliseconds
and memory usage in bytes:
In which order are the dead tuples checked? Looks like in sequential
order? In the case of an index over a column that's not correlated with
the heap order the lookups are often much more random - which can
influence lookup performance drastically, due to cache differences in
cache locality. Which will make some structures look worse/better than
others.
Greetings,
Andres Freund
Hi,
On 2021-07-08 20:53:32 -0700, Andres Freund wrote:
On 2021-07-07 20:46:38 +0900, Masahiko Sawada wrote:
1. Don't allocate more than 1GB. There was a discussion to eliminate
this limitation by using MemoryContextAllocHuge() but there were
concerns about point 2[1].2. Allocate the whole memory space at once.
3. Slow lookup performance (O(logN)).
I’ve done some experiments in this area and would like to share the
results and discuss ideas.Yea, this is a serious issue.
3) could possibly be addressed to a decent degree without changing the
fundamental datastructure too much. There's some sizable and trivial
wins by just changing vac_cmp_itemptr() to compare int64s and by using
an open coded bsearch().
Just using itemptr_encode() makes array in test #1 go from 8s to 6.5s on my
machine.
Another thing I just noticed is that you didn't include the build times for the
datastructures. They are lower than the lookups currently, but it does seem
like a relevant thing to measure as well. E.g. for #1 I see the following build
times
array 24.943 ms
tbm 206.456 ms
intset 93.575 ms
vtbm 134.315 ms
rtbm 145.964 ms
that's a significant range...
Randomizing the lookup order (using a random shuffle in
generate_index_tuples()) changes the benchmark results for #1 significantly:
shuffled time unshuffled time
array 6551.726 ms 6478.554 ms
intset 67590.879 ms 10815.810 ms
rtbm 17992.487 ms 2518.492 ms
tbm 364.917 ms 360.128 ms
vtbm 12227.884 ms 1288.123 ms
FWIW, I get an assertion failure when using an assertion build:
#2 0x0000561800ea02e0 in ExceptionalCondition (conditionName=0x7f9115a88e91 "found", errorType=0x7f9115a88d11 "FailedAssertion",
fileName=0x7f9115a88e8a "rtbm.c", lineNumber=242) at /home/andres/src/postgresql/src/backend/utils/error/assert.c:69
#3 0x00007f9115a87645 in rtbm_add_tuples (rtbm=0x561806293280, blkno=0, offnums=0x7fffdccabb00, nitems=10) at rtbm.c:242
#4 0x00007f9115a8363d in load_rtbm (rtbm=0x561806293280, itemptrs=0x7f908a203050, nitems=10000000) at bdbench.c:618
#5 0x00007f9115a834b9 in rtbm_attach (lvtt=0x7f9115a8c300 <LVTestSubjects+352>, nitems=10000000, minblk=2139062143, maxblk=2139062143, maxoff=32639)
at bdbench.c:587
#6 0x00007f9115a83837 in attach (lvtt=0x7f9115a8c300 <LVTestSubjects+352>, nitems=10000000, minblk=2139062143, maxblk=2139062143, maxoff=32639)
at bdbench.c:658
#7 0x00007f9115a84190 in attach_dead_tuples (fcinfo=0x56180322d690) at bdbench.c:873
I assume you just inverted the Assert(found) assertion?
Greetings,
Andres Freund
On Fri, Jul 9, 2021 at 12:53 PM Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2021-07-07 20:46:38 +0900, Masahiko Sawada wrote:
1. Don't allocate more than 1GB. There was a discussion to eliminate
this limitation by using MemoryContextAllocHuge() but there were
concerns about point 2[1].2. Allocate the whole memory space at once.
3. Slow lookup performance (O(logN)).
I’ve done some experiments in this area and would like to share the
results and discuss ideas.Yea, this is a serious issue.
3) could possibly be addressed to a decent degree without changing the
fundamental datastructure too much. There's some sizable and trivial
wins by just changing vac_cmp_itemptr() to compare int64s and by using
an open coded bsearch().The big problem with bsearch isn't imo the O(log(n)) complexity - it's
that it has an abominally bad cache locality. And that can be addressed
https://arxiv.org/ftp/arxiv/papers/1509/1509.05053.pdfImo 2) isn't really that a hard problem to improve, even if we were to
stay with the current bsearch approach. Reallocation with an aggressive
growth factor or such isn't that bad.That's not to say we ought to stay with binary search...
Problems Solutions
===============Firstly, I've considered using existing data structures:
IntegerSet(src/backend/lib/integerset.c) and
TIDBitmap(src/backend/nodes/tidbitmap.c). Those address point 1 but
only either point 2 or 3. IntegerSet uses lower memory thanks to
simple-8b encoding but is slow at lookup, still O(logN), since it’s a
tree structure. On the other hand, TIDBitmap has a good lookup
performance, O(1), but could unnecessarily use larger memory in some
cases since it always allocates the space for bitmap enough to store
all possible offsets. With 8kB blocks, the maximum number of line
pointers in a heap page is 291 (c.f., MaxHeapTuplesPerPage) so the
bitmap is 40 bytes long and we always need 46 bytes in total per block
including other meta information.Imo tidbitmap isn't particularly good, even in the current use cases -
it's constraining in what we can store (a problem for other AMs), not
actually that dense, the lossy mode doesn't choose what information to
loose well etc.It'd be nice if we came up with a datastructure that could also replace
the bitmap scan cases.
Agreed.
The data structure is somewhat similar to TIDBitmap. It consists of
the hash table and the container area; the hash table has entries per
block and each block entry allocates its memory space, called a
container, in the container area to store its offset numbers. The
container area is actually an array of bytes and can be enlarged as
needed. In the container area, the data representation of offset
numbers varies depending on their cardinality. It has three container
types: array, bitmap, and run.Not a huge fan of encoding this much knowledge about the tid layout...
For example, if there are two dead tuples at offset 1 and 150, it uses
the array container that has an array of two 2-byte integers
representing 1 and 150, using 4 bytes in total. If we used the bitmap
container in this case, we would need 20 bytes instead. On the other
hand, if there are consecutive 20 dead tuples from offset 1 to 20, it
uses the run container that has an array of 2-byte integers. The first
value in each pair represents a starting offset number, whereas the
second value represents its length. Therefore, in this case, the run
container uses only 4 bytes in total. Finally, if there are dead
tuples at every other offset from 1 to 100, it uses the bitmap
container that has an uncompressed bitmap, using 13 bytes. We need
another 16 bytes per block entry for hash table entry.The lookup complexity of a bitmap container is O(1) whereas the one of
an array and a run container is O(N) or O(logN) but the number of
elements in those two containers should not be large it would not be a
problem.Hm. Why is O(N) not an issue? Consider e.g. the case of a table in which
many tuples have been deleted. In cases where the "run" storage is
cheaper (e.g. because there's high offset numbers due to HOT pruning),
we could end up regularly scanning a few hundred entries for a
match. That's not cheap anymore.
With 8kB blocks, the maximum size of a bitmap container is 37 bytes.
IOW, other two types of containers are always smaller than 37 bytes.
Since the run container uses 4 bytes per run, the number of runs in a
run container never be more than 9. Even with 32kB blocks, we don’t
have more than 37 runs. So I think N is small enough in this case.
Evaluation
========Before implementing this idea and integrating it with lazy vacuum
code, I've implemented a benchmark tool dedicated to evaluating
lazy_tid_reaped() performance[4].Good idea!
In all test cases, I simulated that the table has 1,000,000 blocks and
every block has at least one dead tuple.That doesn't strike me as a particularly common scenario? I think it's
quite rare for there to be so evenly but sparse dead tuples. In
particularly it's very common for there to be long runs of dead tuples
separated by long ranges of no dead tuples at all...
Agreed. I'll test with such scenarios.
The benchmark scenario is that for
each virtual heap tuple we check if there is its TID in the dead
tuple storage. Here are the results of execution time in milliseconds
and memory usage in bytes:In which order are the dead tuples checked? Looks like in sequential
order? In the case of an index over a column that's not correlated with
the heap order the lookups are often much more random - which can
influence lookup performance drastically, due to cache differences in
cache locality. Which will make some structures look worse/better than
others.
Good point. It's sequential order, which is not good. I'll test again
after shuffling virtual index tuples.
Regards,
--
Masahiko Sawada
EDB: https://www.enterprisedb.com/
On Fri, Jul 9, 2021 at 2:37 PM Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2021-07-08 20:53:32 -0700, Andres Freund wrote:
On 2021-07-07 20:46:38 +0900, Masahiko Sawada wrote:
1. Don't allocate more than 1GB. There was a discussion to eliminate
this limitation by using MemoryContextAllocHuge() but there were
concerns about point 2[1].2. Allocate the whole memory space at once.
3. Slow lookup performance (O(logN)).
I’ve done some experiments in this area and would like to share the
results and discuss ideas.Yea, this is a serious issue.
3) could possibly be addressed to a decent degree without changing the
fundamental datastructure too much. There's some sizable and trivial
wins by just changing vac_cmp_itemptr() to compare int64s and by using
an open coded bsearch().Just using itemptr_encode() makes array in test #1 go from 8s to 6.5s on my
machine.Another thing I just noticed is that you didn't include the build times for the
datastructures. They are lower than the lookups currently, but it does seem
like a relevant thing to measure as well. E.g. for #1 I see the following build
timesarray 24.943 ms
tbm 206.456 ms
intset 93.575 ms
vtbm 134.315 ms
rtbm 145.964 msthat's a significant range...
Good point. I got similar results when measuring on my machine:
array 57.987 ms
tbm 297.720 ms
intset 113.796 ms
vtbm 165.268 ms
rtbm 199.658 ms
Randomizing the lookup order (using a random shuffle in
generate_index_tuples()) changes the benchmark results for #1 significantly:shuffled time unshuffled time
array 6551.726 ms 6478.554 ms
intset 67590.879 ms 10815.810 ms
rtbm 17992.487 ms 2518.492 ms
tbm 364.917 ms 360.128 ms
vtbm 12227.884 ms 1288.123 ms
I believe that in your test, tbm_reaped() actually always returned
true. That could explain tbm was very fast in both cases. Since
TIDBitmap in the core doesn't support the existence check tbm_reaped()
in bdbench.c always returns true. I added a patch in the repository to
add existence check support to TIDBitmap, although it assumes bitmap
never be lossy.
That being said, I'm surprised that rtbm is slower than array even in
the unshuffled case. I've also measured the shuffle cases and got
different results. To be clear, I used prepare() SQL function to
prepare both virtual dead tuples and index tuples, load them by
attach_dead_tuples() SQL function, and executed bench() SQL function
for each data structure. Here are the results:
shuffled time unshuffled time
array 88899.513 ms 12616.521 ms
intset 73476.055 ms 10063.405 ms
rtbm 22264.671 ms 2073.171 ms
tbm 10285.092 ms 1417.312 ms
vtbm 14488.581 ms 1240.666 ms
FWIW, I get an assertion failure when using an assertion build:
#2 0x0000561800ea02e0 in ExceptionalCondition (conditionName=0x7f9115a88e91 "found", errorType=0x7f9115a88d11 "FailedAssertion",
fileName=0x7f9115a88e8a "rtbm.c", lineNumber=242) at /home/andres/src/postgresql/src/backend/utils/error/assert.c:69
#3 0x00007f9115a87645 in rtbm_add_tuples (rtbm=0x561806293280, blkno=0, offnums=0x7fffdccabb00, nitems=10) at rtbm.c:242
#4 0x00007f9115a8363d in load_rtbm (rtbm=0x561806293280, itemptrs=0x7f908a203050, nitems=10000000) at bdbench.c:618
#5 0x00007f9115a834b9 in rtbm_attach (lvtt=0x7f9115a8c300 <LVTestSubjects+352>, nitems=10000000, minblk=2139062143, maxblk=2139062143, maxoff=32639)
at bdbench.c:587
#6 0x00007f9115a83837 in attach (lvtt=0x7f9115a8c300 <LVTestSubjects+352>, nitems=10000000, minblk=2139062143, maxblk=2139062143, maxoff=32639)
at bdbench.c:658
#7 0x00007f9115a84190 in attach_dead_tuples (fcinfo=0x56180322d690) at bdbench.c:873I assume you just inverted the Assert(found) assertion?
Right. Fixed it.
Regards,
--
Masahiko Sawada
EDB: https://www.enterprisedb.com/
On Thu, Jul 8, 2021 at 7:51 AM Peter Geoghegan <pg@bowt.ie> wrote:
On Wed, Jul 7, 2021 at 1:24 PM Peter Geoghegan <pg@bowt.ie> wrote:
I wonder how much it would help to break up that loop into two loops.
Make the callback into a batch operation that generates state that
describes what to do with each and every index tuple on the leaf page.
The first loop would build a list of TIDs, then you'd call into
vacuumlazy.c and get it to process the TIDs, and finally the second
loop would physically delete the TIDs that need to be deleted. This
would mean that there would be only one call per leaf page per
btbulkdelete(). This would reduce the number of calls to the callback
by at least 100x, and maybe more than 1000x.Maybe for something like rtbm.c (which is inspired by Roaring
bitmaps), you would really want to use an "intersection" operation for
this. The TIDs that we need to physically delete from the leaf page
inside btvacuumpage() are the intersection of two bitmaps: our bitmap
of all TIDs on the leaf page, and our bitmap of all TIDs that need to
be deleting by the ongoing btbulkdelete() call.
Agreed. In such a batch operation, what we need to do here is to
compute the intersection of two bitmaps.
Obviously the typical case is that most TIDs in the index do *not* get
deleted -- needing to delete more than ~20% of all TIDs in the index
will be rare. Ideally it would be very cheap to figure out that a TID
does not need to be deleted at all. Something a little like a negative
cache (but not a true negative cache). This is a little bit like how
hash joins can be made faster by adding a Bloom filter -- most hash
probes don't need to join a tuple in the real world, and we can make
these hash probes even faster by using a Bloom filter as a negative
cache.
Agreed.
If you had the list of TIDs from a leaf page sorted for batch
processing, and if you had roaring bitmap style "chunks" with
"container" metadata stored in the data structure, you could then use
merging/intersection -- that has some of the same advantages. I think
that this would be a lot more efficient than having one binary search
per TID. Most TIDs from the leaf page can be skipped over very
quickly, in large groups. It's very rare for VACUUM to need to delete
TIDs from completely random heap table blocks in the real world (some
kind of pattern is much more common).When this merging process finds 1 TID that might really be deletable
then it's probably going to find much more than 1 -- better to make
that cache miss take care of all of the TIDs together. Also seems like
the CPU could do some clever prefetching with this approach -- it
could prefetch TIDs where the initial chunk metadata is insufficient
to eliminate them early -- these are the groups of TIDs that will have
many TIDs that we actually need to delete. ISTM that improving
temporal locality through batching could matter a lot here.
That's a promising approach.
In rtbm, the pair of one hash entry and one container is used per
block. Therefore, we can skip TID from the leaf page by checking the
hash table, if there is no dead tuple in the block. If there is the
hash entry, since it means the block has at least one dead tuple, we
can look for the offset of TID from the leaf page from the container.
Regards,
--
Masahiko Sawada
EDB: https://www.enterprisedb.com/
On Thu, Jul 8, 2021 at 10:40 PM Hannu Krosing <hannuk@google.com> wrote:
Very nice results.
I have been working on the same problem but a bit different solution -
a mix of binary search for (sub)pages and 32-bit bitmaps for
tid-in-page.Even with currebnt allocation heuristics (allocate 291 tids per page)
it initially allocate much less space, instead of current 291*6=1746
bytes per page it needs to allocate 80 bytes.Also it can be laid out so that it is friendly to parallel SIMD
searches doing up to 8 tid lookups in parallel.
Interesting.
That said, for allocating the tid array, the best solution is to
postpone it as much as possible and to do the initial collection into
a file, which1) postpones the memory allocation to the beginning of index cleanups
2) lets you select the correct size and structure as you know more
about the distribution at that time3) do the first heap pass in one go and then advance frozenxmin
*before* index cleanup
I think we have to do index vacuuming before heap vacuuming (2nd heap
pass). So do you mean that it advances relfrozenxid of pg_class before
both index vacuuming and heap vacuuming?
Regards,
--
Masahiko Sawada
EDB: https://www.enterprisedb.com/
Hi,
On 2021-07-07 20:46:38 +0900, Masahiko Sawada wrote:
Currently, the TIDs of dead tuples are stored in an array that is
collectively allocated at the start of lazy vacuum and TID lookup uses
bsearch(). There are the following challenges and limitations:
So I prototyped a new data structure dedicated to storing dead tuples
during lazy vacuum while borrowing the idea from Roaring Bitmap[2].
The authors provide an implementation of Roaring Bitmap[3] (Apache
2.0 license). But I've implemented this idea from scratch because we
need to integrate it with Dynamic Shared Memory/Area to support
parallel vacuum and need to support ItemPointerData, 6-bytes integer
in total, whereas the implementation supports only 4-bytes integers.
Also, when it comes to vacuum, we neither need to compute the
intersection, the union, nor the difference between sets, but need
only an existence check.The data structure is somewhat similar to TIDBitmap. It consists of
the hash table and the container area; the hash table has entries per
block and each block entry allocates its memory space, called a
container, in the container area to store its offset numbers. The
container area is actually an array of bytes and can be enlarged as
needed. In the container area, the data representation of offset
numbers varies depending on their cardinality. It has three container
types: array, bitmap, and run.
How are you thinking of implementing iteration efficiently for rtbm? The
second heap pass needs that obviously... I think the only option would
be to qsort the whole thing?
Greetings,
Andres Freund
Hi,
On 2021-07-09 10:17:49 -0700, Andres Freund wrote:
On 2021-07-07 20:46:38 +0900, Masahiko Sawada wrote:
Currently, the TIDs of dead tuples are stored in an array that is
collectively allocated at the start of lazy vacuum and TID lookup uses
bsearch(). There are the following challenges and limitations:So I prototyped a new data structure dedicated to storing dead tuples
during lazy vacuum while borrowing the idea from Roaring Bitmap[2].
The authors provide an implementation of Roaring Bitmap[3] (Apache
2.0 license). But I've implemented this idea from scratch because we
need to integrate it with Dynamic Shared Memory/Area to support
parallel vacuum and need to support ItemPointerData, 6-bytes integer
in total, whereas the implementation supports only 4-bytes integers.
Also, when it comes to vacuum, we neither need to compute the
intersection, the union, nor the difference between sets, but need
only an existence check.The data structure is somewhat similar to TIDBitmap. It consists of
the hash table and the container area; the hash table has entries per
block and each block entry allocates its memory space, called a
container, in the container area to store its offset numbers. The
container area is actually an array of bytes and can be enlarged as
needed. In the container area, the data representation of offset
numbers varies depending on their cardinality. It has three container
types: array, bitmap, and run.How are you thinking of implementing iteration efficiently for rtbm? The
second heap pass needs that obviously... I think the only option would
be to qsort the whole thing?
I experimented further, trying to use an old radix tree implementation I
had lying around to store dead tuples. With a bit of trickery that seems
to work well.
The radix tree implementation I have basically maps an int64 to another
int64. Each level of the radix tree stores 6 bits of the key, and uses
those 6 bits to index a 1<<64 long array leading to the next level.
My first idea was to use itemptr_encode() to convert tids into an int64
and store the lower 6 bits in the value part of the radix tree. That
turned out to work well performance wise, but awfully memory usage
wise. The problem is that we at most use 9 bits for offsets, but reserve
16 bits for it in the ItemPointerData. Which means that there's often a
lot of empty "tree levels" for those 0 bits, making it hard to get to a
decent memory usage.
The simplest way to address that was to simply compress out those
guaranteed-to-be-zero bits. That results in memory usage that's quite
good - nearly always beating array, occasionally beating rtbm. It's an
ordered datastructure, so the latter isn't too surprising. For lookup
performance the radix approach is commonly among the best, if not the
best.
A variation of the storage approach is to just use the block number as
the index, and store the tids as the value. Even with the absolutely
naive approach of just using a Bitmapset that reduces memory usage
substantially - at a small cost to search performance. Of course it'd be
better to use an adaptive approach like you did for rtbm, I just thought
this is good enough.
This largely works well, except when there are a large number of evenly
spread out dead tuples. I don't think that's a particularly common
situation, but it's worth considering anyway.
The reason the memory usage can be larger for sparse workloads obviously
can lead to tree nodes with only one child. As they are quite large
(1<<6 pointers to further children) that then can lead to large increase
in memory usage.
I have toyed with implementing adaptively large radix nodes like
proposed in https://db.in.tum.de/~leis/papers/ART.pdf - but haven't
gotten it quite working.
Greetings,
Andres Freund
On Sat, Jul 10, 2021 at 2:17 AM Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2021-07-07 20:46:38 +0900, Masahiko Sawada wrote:
Currently, the TIDs of dead tuples are stored in an array that is
collectively allocated at the start of lazy vacuum and TID lookup uses
bsearch(). There are the following challenges and limitations:So I prototyped a new data structure dedicated to storing dead tuples
during lazy vacuum while borrowing the idea from Roaring Bitmap[2].
The authors provide an implementation of Roaring Bitmap[3] (Apache
2.0 license). But I've implemented this idea from scratch because we
need to integrate it with Dynamic Shared Memory/Area to support
parallel vacuum and need to support ItemPointerData, 6-bytes integer
in total, whereas the implementation supports only 4-bytes integers.
Also, when it comes to vacuum, we neither need to compute the
intersection, the union, nor the difference between sets, but need
only an existence check.The data structure is somewhat similar to TIDBitmap. It consists of
the hash table and the container area; the hash table has entries per
block and each block entry allocates its memory space, called a
container, in the container area to store its offset numbers. The
container area is actually an array of bytes and can be enlarged as
needed. In the container area, the data representation of offset
numbers varies depending on their cardinality. It has three container
types: array, bitmap, and run.How are you thinking of implementing iteration efficiently for rtbm? The
second heap pass needs that obviously... I think the only option would
be to qsort the whole thing?
Yes, I'm thinking that the iteration of rtbm is somewhat similar to
tbm. That is, we iterate and collect hash table entries and do qsort
hash entries by the block number. Then fetch the entry along with its
container one by one in order of the block number.
Regards,
--
Masahiko Sawada
EDB: https://www.enterprisedb.com/