[PoC] Improve dead tuple storage for lazy vacuum

Started by Masahiko Sawadaalmost 5 years ago462 messageshackers

sawada.mshk@gmail.com

almost 5 years ago

Hi all,

Index vacuuming is one of the most time-consuming processes in lazy
vacuuming. lazy_tid_reaped() is a large part among them. The attached
the flame graph shows a profile of a vacuum on a table that has one index
and 80 million live rows and 20 million dead rows, where
lazy_tid_reaped() accounts for about 47% of the total vacuum execution
time.

lazy_tid_reaped() is essentially an existence check; for every index
tuple, it checks if the TID of the heap it points to exists in the set
of TIDs of dead tuples. The maximum size of dead tuples is limited by
maintenance_work_mem, and if the upper limit is reached, the heap scan
is suspended, index vacuum and heap vacuum are performed, and then
heap scan is resumed again. Therefore, in terms of the performance of
index vacuuming, there are two important factors: the performance of
lookup TIDs from the set of dead tuples and its memory usage. The
former is obvious whereas the latter affects the number of Index
vacuuming. In many index AMs, index vacuuming (i.e., ambulkdelete)
performs a full scan of the index, so it is important in terms of
performance to avoid index vacuuming from being executed more than
once during lazy vacuum.

Currently, the TIDs of dead tuples are stored in an array that is
collectively allocated at the start of lazy vacuum and TID lookup uses
bsearch(). There are the following challenges and limitations:

1. Don't allocate more than 1GB. There was a discussion to eliminate
this limitation by using MemoryContextAllocHuge() but there were
concerns about point 2[1]/messages/by-id/CAGTBQpbDCaR6vv9=scXzuT8fSbckf=a3NgZdWFWZbdVugVht6Q@mail.gmail.com.

2. Allocate the whole memory space at once.

3. Slow lookup performance (O(logN)).

I’ve done some experiments in this area and would like to share the
results and discuss ideas.

Problems Solutions
===============

Firstly, I've considered using existing data structures:
IntegerSet(src/backend/lib/integerset.c) and
TIDBitmap(src/backend/nodes/tidbitmap.c). Those address point 1 but
only either point 2 or 3. IntegerSet uses lower memory thanks to
simple-8b encoding but is slow at lookup, still O(logN), since it’s a
tree structure. On the other hand, TIDBitmap has a good lookup
performance, O(1), but could unnecessarily use larger memory in some
cases since it always allocates the space for bitmap enough to store
all possible offsets. With 8kB blocks, the maximum number of line
pointers in a heap page is 291 (c.f., MaxHeapTuplesPerPage) so the
bitmap is 40 bytes long and we always need 46 bytes in total per block
including other meta information.

So I prototyped a new data structure dedicated to storing dead tuples
during lazy vacuum while borrowing the idea from Roaring Bitmap[2]http://roaringbitmap.org/.
The authors provide an implementation of Roaring Bitmap[3]https://github.com/RoaringBitmap/CRoaring (Apache
2.0 license). But I've implemented this idea from scratch because we
need to integrate it with Dynamic Shared Memory/Area to support
parallel vacuum and need to support ItemPointerData, 6-bytes integer
in total, whereas the implementation supports only 4-bytes integers.
Also, when it comes to vacuum, we neither need to compute the
intersection, the union, nor the difference between sets, but need
only an existence check.

The data structure is somewhat similar to TIDBitmap. It consists of
the hash table and the container area; the hash table has entries per
block and each block entry allocates its memory space, called a
container, in the container area to store its offset numbers. The
container area is actually an array of bytes and can be enlarged as
needed. In the container area, the data representation of offset
numbers varies depending on their cardinality. It has three container
types: array, bitmap, and run.

For example, if there are two dead tuples at offset 1 and 150, it uses
the array container that has an array of two 2-byte integers
representing 1 and 150, using 4 bytes in total. If we used the bitmap
container in this case, we would need 20 bytes instead. On the other
hand, if there are consecutive 20 dead tuples from offset 1 to 20, it
uses the run container that has an array of 2-byte integers. The first
value in each pair represents a starting offset number, whereas the
second value represents its length. Therefore, in this case, the run
container uses only 4 bytes in total. Finally, if there are dead
tuples at every other offset from 1 to 100, it uses the bitmap
container that has an uncompressed bitmap, using 13 bytes. We need
another 16 bytes per block entry for hash table entry.

The lookup complexity of a bitmap container is O(1) whereas the one of
an array and a run container is O(N) or O(logN) but the number of
elements in those two containers should not be large it would not be a
problem.

Evaluation
========

Before implementing this idea and integrating it with lazy vacuum
code, I've implemented a benchmark tool dedicated to evaluating
lazy_tid_reaped() performance[4]https://github.com/MasahikoSawada/pgtools/tree/master/bdbench. It has some functions: generating
TIDs for both index tuples and dead tuples, loading dead tuples to the
data structure, simulating lazy_tid_reaped() using those virtual heap
tuples and heap dead tuples. So the code lacks many features such as
iteration and DSM/DSA support but it makes testing of data structure
easier.

FYI I've confirmed the validity of this tool. When I ran a vacuum on
the table with 3GB size, index vacuuming took 12.3 sec and
lazy_tid_reaped() took approximately 8.5 sec. Simulating a similar
situation with the tool, the lookup benchmark with the array data
structure took approximately 8.0 sec. Given that the tool doesn't
simulate the cost of function calls, it seems to reasonably simulate
it.

I've evaluated the lookup performance and memory foot point against
the four types of data structure: array, integerset (intset),
tidbitmap (tbm), roaring tidbitmap (rtbm) while changing the
distribution of dead tuples in blocks. Since tbm doesn't have a
function for existence check I've added it and allocate enough memory
to make sure that tbm never be lossy during the evaluation. In all
test cases, I simulated that the table has 1,000,000 blocks and every
block has at least one dead tuple. The benchmark scenario is that for
each virtual heap tuple we check if there is its TID in the dead
tuple storage. Here are the results of execution time in milliseconds
and memory usage in bytes:

* Test-case 1 (10 dead tuples in 20 offsets interval)

An array container is selected in this test case, using 20 bytes for each block.

Execution Time Memory Usage
array 14,140.91 60,008,248
intset 9,350.08 50,339,840
tbm 1,299.62 100,671,544
rtbm 1,892.52 58,744,944

* Test-case 2 (10 consecutive dead tuples from offset 1)

A bitmap container is selected in this test case, using 2 bytes for each block.

Execution Time Memory Usage
array 1,056.60 60,008,248
intset 650.85 50,339,840
tbm 194.61 100,671,544
rtbm 154.57 27,287,664

* Test-case 3 (2 dead tuples at 1 and 100 offsets)

An array container is selected in this test case, using 4 bytes for
each block. Since 'array' data structure (not array container of rtbm)
uses only 12 bytes for each block, given that the size of hash table
entry size in 'rtbm', 'array' data structure uses less memory.

Execution Time Memory Usage
array 6,054.22 12,008,248
intset 4,203.41 16,785,408
tbm 759.17 100,671,544
rtbm 750.08 29,384,816

* Test-case 4 (100 consecutive dead tuples from 1)

A run container is selected in this test case, using 4 bytes for each block.

Execution Time Memory Usage
array 8,883.03 600,008,248
intset 7,358.23 100,671,488
tbm 758.81 100,671,544
rtbm 764.33 29,384,816

Overall, 'rtbm' has a much better lookup performance and good memory
usage especially if there are relatively many dead tuples. However, in
some cases, 'intset' and 'array' have a better memory usage.

Feedback is very welcome. Thank you for reading the email through to the end.

Regards,

[1]: /messages/by-id/CAGTBQpbDCaR6vv9=scXzuT8fSbckf=a3NgZdWFWZbdVugVht6Q@mail.gmail.com
[2]: http://roaringbitmap.org/
[3]: https://github.com/RoaringBitmap/CRoaring
[4]: https://github.com/MasahikoSawada/pgtools/tree/master/bdbench

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

Matthias van de Meent

boekewurm+postgres@gmail.com

almost 5 years ago

In reply to: Masahiko Sawada (#1)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Wed, 7 Jul 2021 at 13:47, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Hi all,

Index vacuuming is one of the most time-consuming processes in lazy
vacuuming. lazy_tid_reaped() is a large part among them. The attached
the flame graph shows a profile of a vacuum on a table that has one index
and 80 million live rows and 20 million dead rows, where
lazy_tid_reaped() accounts for about 47% of the total vacuum execution
time.

[...]

Overall, 'rtbm' has a much better lookup performance and good memory
usage especially if there are relatively many dead tuples. However, in
some cases, 'intset' and 'array' have a better memory usage.

Those are some great results, with a good path to meaningful improvements.

Feedback is very welcome. Thank you for reading the email through to the end.

The current available infrastructure for TIDs is quite ill-defined for
TableAM authors [0]/messages/by-id/0bbeb784050503036344e1f08513f13b2083244b.camel@j-davis.com, and other TableAMs might want to use more than
just the 11 bits in use by max-BLCKSZ HeapAM MaxHeapTuplesPerPage to
identify tuples. (MaxHeapTuplesPerPage is 1169 at the maximum 32k
BLCKSZ, which requires 11 bits to fit).

Could you also check what the (performance, memory) impact would be if
these proposed structures were to support the maximum
MaxHeapTuplesPerPage of 1169 or the full uint16-range of offset
numbers that could be supported by our current TID struct?

Kind regards,

Matthias van de Meent

[0]: /messages/by-id/0bbeb784050503036344e1f08513f13b2083244b.camel@j-davis.com

Peter Geoghegan

pg@bowt.ie

almost 5 years ago

In reply to: Masahiko Sawada (#1)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Wed, Jul 7, 2021 at 4:47 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Currently, the TIDs of dead tuples are stored in an array that is
collectively allocated at the start of lazy vacuum and TID lookup uses
bsearch(). There are the following challenges and limitations:

1. Don't allocate more than 1GB. There was a discussion to eliminate
this limitation by using MemoryContextAllocHuge() but there were
concerns about point 2[1].

I think that the main problem with the 1GB limitation is that it is
surprising -- it can cause disruption when we first exceed the magical
limit of ~174 million TIDs. This can cause us to dirty index pages a
second time when we might have been able to just do it once with
sufficient memory for TIDs. OTOH there are actually cases where having
less memory for TIDs makes performance *better* because of locality
effects. This perverse behavior with memory sizing isn't a rare case
that we can safely ignore -- unfortunately it's fairly common.

My point is that we should be careful to choose the correct goal.
Obviously memory use matters. But it might be more helpful to think of
memory use as just a proxy for what truly matters, not a goal in
itself. It's hard to know what this means (what is the "real goal"?),
and hard to measure it even if you know for sure. It could still be
useful to think of it like this.

A run container is selected in this test case, using 4 bytes for each block.

Execution Time Memory Usage
array 8,883.03 600,008,248
intset 7,358.23 100,671,488
tbm 758.81 100,671,544
rtbm 764.33 29,384,816

Overall, 'rtbm' has a much better lookup performance and good memory
usage especially if there are relatively many dead tuples. However, in
some cases, 'intset' and 'array' have a better memory usage.

This seems very promising.

I wonder how much you have thought about the index AM side. It makes
sense to initially evaluate these techniques using this approach of
separating the data structure from how it is used by VACUUM -- I think
that that was a good idea. But at the same time there may be certain
important theoretical questions that cannot be answered this way --
questions about how everything "fits together" in a real VACUUM might
matter a lot. You've probably thought about this at least a little
already. Curious to hear how you think it "fits together" with the
work that you've done already.

The loop inside btvacuumpage() makes each loop iteration call the
callback -- this is always a call to lazy_tid_reaped() in practice.
And that's where we do binary searches. These binary searches are
usually where we see a huge number of cycles spent when we look at
profiles, including the profile that produced your flame graph. But I
worry that that might be a bit misleading -- the way that profilers
attribute costs is very complicated and can never be fully trusted.
While it is true that lazy_tid_reaped() often accesses main memory,
which will of course add a huge amount of latency and make it a huge
bottleneck, the "big picture" is still relevant.

I think that the compiler currently has to make very conservative
assumptions when generating the machine code used by the loop inside
btvacuumpage(), which calls through an opaque function pointer at
least once per loop iteration -- anything can alias, so the compiler
must be conservative. The data dependencies are hard for both the
compiler and the CPU to analyze. The cost of using a function pointer
compared to a direct function call is usually quite low, but there are
important exceptions -- cases where it prevents other useful
optimizations. Maybe this is an exception.

I wonder how much it would help to break up that loop into two loops.
Make the callback into a batch operation that generates state that
describes what to do with each and every index tuple on the leaf page.
The first loop would build a list of TIDs, then you'd call into
vacuumlazy.c and get it to process the TIDs, and finally the second
loop would physically delete the TIDs that need to be deleted. This
would mean that there would be only one call per leaf page per
btbulkdelete(). This would reduce the number of calls to the callback
by at least 100x, and maybe more than 1000x.

This approach would make btbulkdelete() similar to
_bt_simpledel_pass() + _bt_delitems_delete_check(). This is not really
an independent idea to your ideas -- I imagine that this would work
far better when combined with a more compact data structure, which is
naturally more capable of batch processing than a simple array of
TIDs. Maybe this will help the compiler and the CPU to fully
understand the *natural* data dependencies, so that they can be as
effective as possible in making the code run fast. It's possible that
a modern CPU will be able to *hide* the latency more intelligently
than what we have today. The latency is such a big problem that we may
be able to justify "wasting" other CPU resources, just because it
sometimes helps with hiding the latency. For example, it might
actually be okay to sort all of the TIDs on the page to make the bulk
processing work -- though you might still do a precheck that is
similar to the precheck inside lazy_tid_reaped() that was added by you
in commit bbaf315309e.

Of course it's very easy to be wrong about stuff like this. But it
might not be that hard to prototype. You can literally copy and paste
code from _bt_delitems_delete_check() to do this. It does the same
basic thing already.

--
Peter Geoghegan

Peter Geoghegan

pg@bowt.ie

almost 5 years ago

In reply to: Peter Geoghegan (#3)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Wed, Jul 7, 2021 at 1:24 PM Peter Geoghegan <pg@bowt.ie> wrote:

I wonder how much it would help to break up that loop into two loops.
Make the callback into a batch operation that generates state that
describes what to do with each and every index tuple on the leaf page.
The first loop would build a list of TIDs, then you'd call into
vacuumlazy.c and get it to process the TIDs, and finally the second
loop would physically delete the TIDs that need to be deleted. This
would mean that there would be only one call per leaf page per
btbulkdelete(). This would reduce the number of calls to the callback
by at least 100x, and maybe more than 1000x.

Maybe for something like rtbm.c (which is inspired by Roaring
bitmaps), you would really want to use an "intersection" operation for
this. The TIDs that we need to physically delete from the leaf page
inside btvacuumpage() are the intersection of two bitmaps: our bitmap
of all TIDs on the leaf page, and our bitmap of all TIDs that need to
be deleting by the ongoing btbulkdelete() call.

Obviously the typical case is that most TIDs in the index do *not* get
deleted -- needing to delete more than ~20% of all TIDs in the index
will be rare. Ideally it would be very cheap to figure out that a TID
does not need to be deleted at all. Something a little like a negative
cache (but not a true negative cache). This is a little bit like how
hash joins can be made faster by adding a Bloom filter -- most hash
probes don't need to join a tuple in the real world, and we can make
these hash probes even faster by using a Bloom filter as a negative
cache.

If you had the list of TIDs from a leaf page sorted for batch
processing, and if you had roaring bitmap style "chunks" with
"container" metadata stored in the data structure, you could then use
merging/intersection -- that has some of the same advantages. I think
that this would be a lot more efficient than having one binary search
per TID. Most TIDs from the leaf page can be skipped over very
quickly, in large groups. It's very rare for VACUUM to need to delete
TIDs from completely random heap table blocks in the real world (some
kind of pattern is much more common).

When this merging process finds 1 TID that might really be deletable
then it's probably going to find much more than 1 -- better to make
that cache miss take care of all of the TIDs together. Also seems like
the CPU could do some clever prefetching with this approach -- it
could prefetch TIDs where the initial chunk metadata is insufficient
to eliminate them early -- these are the groups of TIDs that will have
many TIDs that we actually need to delete. ISTM that improving
temporal locality through batching could matter a lot here.

--
Peter Geoghegan

[PoC] Improve dead tuple storage for lazy vacuum

Attachments: