Processing btree walks as a batch to parallelize IO

Started by James Colemanalmost 5 years ago5 messages

jtc331@gmail.com

almost 5 years ago

$SUBJECT is still a very loosely formed idea, so forgive lack of detail or
things I've likely missed, but I wanted to get it out there to see if it
sounded at all intriguing to people.

Background: One of the big problems with non-local storage such as AWS EBS
volumes or a SAN is that in a large database (really, working set, where
working set includes reads) exceeds the size of buffer cache (and page
cache) the cost of random page reads hitting the underlying disk system
dominates. This is because networked disks have an order of magnitude
higher latency than a bunch of RAIDed SSDs (even more so with NVMe
storage). In some of our experiments on Aurora I've seen a 10x change
versus pretty good physical hardware, and I'd assume RDS (since it's
EBS-backed) is similar.

A specific area where this is particularly painful is btree index reads.
Walking the tree to leaf pages isn't naturally prefetchable, and so for
each level you pay the random page cost. Of course higher levels in the
tree will almost certainly exhibit emergent behavior such that they (just
by fact of the LRU caching) will be in the buffer cache, but for a large
index lower levels likely won't be.

If we squint a bit, insertions look a whole lot like reads as well since we
have to walk the tree to find the leaf insertion page for a new tuple. This
is particularly true for indexes where inserts are roughly randomly
distributed data, like a uuid.

The read-for-lookups problem is harder to solve, but the cost as it relates
to table inserts is possibly more tractable. Tables typically have more
than one index to update, so the obvious approach is "let's just
parallelize the index insertions". Of course we know that's difficult given
the multi-process approach Postgres uses for parallelism.

Another approach that at first glance seems like it fits better into
Postgres (I'm not claiming it's easy or a small patch) would be to process
a batch of indexes at once. For example, if the index access methods were
extended to allow being given a list of indexes that need to be walked,
then the btree code could process each layer in the walk as a group --
issuing IO fetches for all of the first level blocks in the tree, and then
computing all of the next level blocks needed and issuing those IO requests
at a time, and so on.

In some workloads we've been testing I believe such an approach could
plausibly improve table insert (and update) performance by multiple
hundreds of percent.

I don't have any code at the moment to show here, but I wanted to get the
idea out there to see if there were any immediate reactions or other
thoughts on the topic.

Thoughts?

James

Tomas Vondra

tomas.vondra@enterprisedb.com

almost 5 years ago

In reply to: James Coleman (#1)

Re: Processing btree walks as a batch to parallelize IO

On 4/9/21 7:33 PM, James Coleman wrote:

$SUBJECT is still a very loosely formed idea, so forgive lack of detail
or things I've likely missed, but I wanted to get it out there to see if
it sounded at all intriguing to people.

Background: One of the big problems with non-local storage such as AWS
EBS volumes or a SAN is that in a large database (really, working set,
where working set includes reads) exceeds the size of buffer cache (and
page cache) the cost of random page reads hitting the underlying disk
system dominates. This is because networked disks have an order of
magnitude higher latency than a bunch of RAIDed SSDs (even more so with
NVMe storage). In some of our experiments on Aurora I've seen a 10x
change versus pretty good physical hardware, and I'd assume RDS (since
it's EBS-backed) is similar.

A specific area where this is particularly painful is btree index reads.
Walking the tree to leaf pages isn't naturally prefetchable, and so for
each level you pay the random page cost. Of course higher levels in the
tree will almost certainly exhibit emergent behavior such that they
(just by fact of the LRU caching) will be in the buffer cache, but for a
large index lower levels likely won't be.

What do you consider a large index level?

Consider a 1TB table, with just a single UUID column - that's ~25B rows,
give or take. Real tables will have more columns, so this seems like a
reasonable model of the largest number of rows per relation. With ~32B
per index tuple, that's about 100M leaf pages, and with ~256 branches
per internal page, that's still only ~5 levels. I think it's quite rare
to see indexes with more than 6 or 7 levels.

And the internal pages are maybe 0.5% of the whole index (so ~4GB out of
750GB). I think the usual expectation is that most of that will fit into
RAM, but of course there may be more indexes competing for that.

I think the index level is not really the crucial bit - it's more about
the total amount of indexes in the DB.

If we squint a bit, insertions look a whole lot like reads as well since
we have to walk the tree to find the leaf insertion page for a new
tuple. This is particularly true for indexes where inserts are roughly
randomly distributed data, like a uuid.

Yep. We need to walk the index to the leaf pages in both cases, both for
read and insert workloads.

The read-for-lookups problem is harder to solve, but the cost as it
relates to table inserts is possibly more tractable. Tables typically
have more than one index to update, so the obvious approach is "let's
just parallelize the index insertions". Of course we know that's
difficult given the multi-process approach Postgres uses for parallelism.

Hmm. Not sure if reads are harder to real with, but I think you're right
those two cases (reads and writes) may look similar at the level of a
single index, but may need rather different approaches exactly because
inserts have to deal with all indexes, while reads only really deal with
a single index.

FWIW I think there are a couple options for improving reads, at least in
some cases.

1) I wonder if e.g. _bt_readnextpage could prefetch at least one page
ahead. We can't look further ahead, but perhaps this would help.

2) In some cases (e.g. nested loop with inner indexes scan) we could
collect an array of values and then look them up at once, which should
allow us to do at least some fo the I/O in parallel, I think. That's
similar to what you propose for writes, except that it works against the
same index.

Another approach that at first glance seems like it fits better into
Postgres (I'm not claiming it's easy or a small patch) would be to
process a batch of indexes at once. For example, if the index access
methods were extended to allow being given a list of indexes that need
to be walked, then the btree code could process each layer in the walk
as a group -- issuing IO fetches for all of the first level blocks in
the tree, and then computing all of the next level blocks needed and
issuing those IO requests at a time, and so on.

Yeah, I agree having a way to say "prefetch all pages needed to insert
these keys into these indexes" might be better than just parallelizing
it in a "naive" way.

Not sure how complex would it be - I think the API would need to allow
traversing the index with each step split into two phases:

1) determine the page needed for the next step, return it to caller

2) the caller collects pages from all indexes, initiates prefetch

3) instruct indexes to actually do the next step, stop if it's a leaf
page (otherwise go to (1))

And then we might just do index inserts in a serial way, just like we do
today, hoping to hit the prefetched pages.

FWIW while this probably helps saturating the I/O, it unfortunately does
nothing to reduce the write amplification - we still need to modify the
same amount of leaf pages in all indexes, produce the same amount of WAL
etc. I think there were some proposals to add small internal buffers,
and instead of pushing the inserts all the way down to the leaf page,
just add them to the internal buffer. And when the buffer gets full,
propagate the contents to the next level of buffers.

For example, each internal page might have one "buffer" page, so the
index size would not really change (the internal pages would double, but
it's still jut ~1% of the total index size). Of course, this makes
lookups more complex/expensive, because we need to check the internal
buffers. But it does reduce the write amplification, because it combines
changes to leaf pages.

In some workloads we've been testing I believe such an approach could
plausibly improve table insert (and update) performance by multiple
hundreds of percent.

I don't have any code at the moment to show here, but I wanted to get
the idea out there to see if there were any immediate reactions or other
thoughts on the topic.

Thoughts?

I think you're right indexes may be a serious bottleneck in some cases,
so exploring ways to improve that seems useful. Ultimately I think we
should be looking for ways to reduce the amount of work we need to do,
but parallelizing it (i.e. doing the same amount of work but in multiple
processes) is a valid approach too.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

James Coleman

jtc331@gmail.com

over 4 years ago

In reply to: Tomas Vondra (#2)

Re: Processing btree walks as a batch to parallelize IO

On Fri, Apr 9, 2021 at 4:57 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

On 4/9/21 7:33 PM, James Coleman wrote:

$SUBJECT is still a very loosely formed idea, so forgive lack of detail
or things I've likely missed, but I wanted to get it out there to see if
it sounded at all intriguing to people.

Background: One of the big problems with non-local storage such as AWS
EBS volumes or a SAN is that in a large database (really, working set,
where working set includes reads) exceeds the size of buffer cache (and
page cache) the cost of random page reads hitting the underlying disk
system dominates. This is because networked disks have an order of
magnitude higher latency than a bunch of RAIDed SSDs (even more so with
NVMe storage). In some of our experiments on Aurora I've seen a 10x
change versus pretty good physical hardware, and I'd assume RDS (since
it's EBS-backed) is similar.

A specific area where this is particularly painful is btree index reads.
Walking the tree to leaf pages isn't naturally prefetchable, and so for
each level you pay the random page cost. Of course higher levels in the
tree will almost certainly exhibit emergent behavior such that they
(just by fact of the LRU caching) will be in the buffer cache, but for a
large index lower levels likely won't be.

What do you consider a large index level?

In general it's probably all levels but the leaves (though depends on
cache and index size etc.)

Consider a 1TB table, with just a single UUID column - that's ~25B rows,
give or take. Real tables will have more columns, so this seems like a
reasonable model of the largest number of rows per relation. With ~32B
per index tuple, that's about 100M leaf pages, and with ~256 branches
per internal page, that's still only ~5 levels. I think it's quite rare
to see indexes with more than 6 or 7 levels.

And the internal pages are maybe 0.5% of the whole index (so ~4GB out of
750GB). I think the usual expectation is that most of that will fit into
RAM, but of course there may be more indexes competing for that.

I think the index level is not really the crucial bit - it's more about
the total amount of indexes in the DB.

I suppose? If the tables/indexes/etc. size is sufficiently large
relative to cache size it won't matter the quantity.

If we squint a bit, insertions look a whole lot like reads as well since
we have to walk the tree to find the leaf insertion page for a new
tuple. This is particularly true for indexes where inserts are roughly
randomly distributed data, like a uuid.

Yep. We need to walk the index to the leaf pages in both cases, both for
read and insert workloads.

The read-for-lookups problem is harder to solve, but the cost as it
relates to table inserts is possibly more tractable. Tables typically
have more than one index to update, so the obvious approach is "let's
just parallelize the index insertions". Of course we know that's
difficult given the multi-process approach Postgres uses for parallelism.

Hmm. Not sure if reads are harder to real with, but I think you're right
those two cases (reads and writes) may look similar at the level of a
single index, but may need rather different approaches exactly because
inserts have to deal with all indexes, while reads only really deal with
a single index.

Right. In practice it's harder to deal with a single index scan
because you don't have multiple such scans to parallelize.

FWIW I think there are a couple options for improving reads, at least in
some cases.

1) I wonder if e.g. _bt_readnextpage could prefetch at least one page
ahead. We can't look further ahead, but perhaps this would help.

2) In some cases (e.g. nested loop with inner indexes scan) we could
collect an array of values and then look them up at once, which should
allow us to do at least some fo the I/O in parallel, I think. That's
similar to what you propose for writes, except that it works against the
same index.

The "collect an array of values" approach isn't one I'd considered,
but seems likely interesting.

Another approach that at first glance seems like it fits better into
Postgres (I'm not claiming it's easy or a small patch) would be to
process a batch of indexes at once. For example, if the index access
methods were extended to allow being given a list of indexes that need
to be walked, then the btree code could process each layer in the walk
as a group -- issuing IO fetches for all of the first level blocks in
the tree, and then computing all of the next level blocks needed and
issuing those IO requests at a time, and so on.

Yeah, I agree having a way to say "prefetch all pages needed to insert
these keys into these indexes" might be better than just parallelizing
it in a "naive" way.

Not sure how complex would it be - I think the API would need to allow
traversing the index with each step split into two phases:

1) determine the page needed for the next step, return it to caller

2) the caller collects pages from all indexes, initiates prefetch

3) instruct indexes to actually do the next step, stop if it's a leaf
page (otherwise go to (1))

And then we might just do index inserts in a serial way, just like we do
today, hoping to hit the prefetched pages.

Correct; this is roughly what I was envisioning.

FWIW while this probably helps saturating the I/O, it unfortunately does
nothing to reduce the write amplification - we still need to modify the
same amount of leaf pages in all indexes, produce the same amount of WAL
etc. I think there were some proposals to add small internal buffers,
and instead of pushing the inserts all the way down to the leaf page,
just add them to the internal buffer. And when the buffer gets full,
propagate the contents to the next level of buffers.

For example, each internal page might have one "buffer" page, so the
index size would not really change (the internal pages would double, but
it's still jut ~1% of the total index size). Of course, this makes
lookups more complex/expensive, because we need to check the internal
buffers. But it does reduce the write amplification, because it combines
changes to leaf pages.

I think I've seen that discussion, and it's very interesting, but also
I think still orthogonal to this.

In some workloads we've been testing I believe such an approach could
plausibly improve table insert (and update) performance by multiple
hundreds of percent.

I don't have any code at the moment to show here, but I wanted to get
the idea out there to see if there were any immediate reactions or other
thoughts on the topic.

Thoughts?

I think you're right indexes may be a serious bottleneck in some cases,
so exploring ways to improve that seems useful. Ultimately I think we
should be looking for ways to reduce the amount of work we need to do,
but parallelizing it (i.e. doing the same amount of work but in multiple
processes) is a valid approach too.

Thanks for the feedback.

James

Greg Stark

stark@mit.edu

over 4 years ago

In reply to: Tomas Vondra (#2)

Re: Processing btree walks as a batch to parallelize IO

On Fri, 9 Apr 2021 at 16:58, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:

On 4/9/21 7:33 PM, James Coleman wrote:

A specific area where this is particularly painful is btree index reads.
Walking the tree to leaf pages isn't naturally prefetchable, and so for
each level you pay the random page cost. Of course higher levels in the
tree will almost certainly exhibit emergent behavior such that they
(just by fact of the LRU caching) will be in the buffer cache, but for a
large index lower levels likely won't be.

We've talked before about buffering inserts even just for disk-based
indexes. Much like how GIN buffers inserts and periodically flushes
them out. We talked about doing a local buffer in each session since
no other session even needs to see these buffered inserts until commit
anyways. And we can more efficiently merge in multiple keys at once
than doing them one by one.

But that was just for disk i/o. For something longer-latency it would
be an even bigger win. Buffer the inserted keys in local memory in
case you do lookups in this same session and start the i/o to insert
the rows into the index but handle that in the background or in a
separate process without blocking the transaction until commit.

What do you consider a large index level?

Consider a 1TB table, with just a single UUID column - that's ~25B rows,
give or take. Real tables will have more columns, so this seems like a
reasonable model of the largest number of rows per relation. With ~32B
per index tuple, that's about 100M leaf pages, and with ~256 branches
per internal page, that's still only ~5 levels. I think it's quite rare
to see indexes with more than 6 or 7 levels.

That's a good model for a well-designed schema with an efficient
index. There are plenty of less-than-optimal schemas with indexes on
longer column lists or fairly large text fields....

--
greg

Peter Geoghegan

pg@bowt.ie

over 4 years ago

In reply to: Greg Stark (#4)

Re: Processing btree walks as a batch to parallelize IO

On Fri, May 7, 2021 at 3:34 PM Greg Stark <stark@mit.edu> wrote:

We've talked before about buffering inserts even just for disk-based
indexes. Much like how GIN buffers inserts and periodically flushes
them out. We talked about doing a local buffer in each session since
no other session even needs to see these buffered inserts until commit
anyways. And we can more efficiently merge in multiple keys at once
than doing them one by one.

Mark Callaghan's high level analysis of the trade-offs here is worth a
read, too.

That's a good model for a well-designed schema with an efficient
index. There are plenty of less-than-optimal schemas with indexes on
longer column lists or fairly large text fields....

Suffix truncation can take care of this -- all you really need is a
minimally distinguishing separator key to delineate which values
belong on which page one level down. It is almost always possible for
leaf page splits to find a way to make the new high key (also the key
to be inserted in the parent level) much smaller than your typical
key. Granted, we don't have what I've called "classic" suffix
truncation (within text column truncation) yet, so this analysis isn't
going to work with long text keys (we only truncate at the attribute
granularity currently).

Even if we're pessimistic about suffix truncation, the logarithmic
rate of growth still wins -- Tomas' analysis is sound. You cannot
realistically make a Postgres B-Tree have more than about 1% of all
pages as internal pages, unless you make the indexed keys ludicrously
large -- as in several hundred bytes each (~0.5% is typical in
practice). I think that 6 levels is very pessimistic, even with a
massive B-Tree with weirdly large keys. My mental model for internal
pages is that they are practically guaranteed to be in shared_buffers
at all times, which is about as accurate as any generalization like
that ever can be.

I once wrote a test harness that deliberately created a B-Tree that
was as tall as possible -- something with the largest possible index
tuples on the leaf level (had to disable TOAST for this). I think that
it was about 7 or 8 levels deep. The CPU overhead of the test case
made it excruciatingly slow, but it wasn't I/O bound at all (pretty
sure it all fitted in shared_buffers).

--
Peter Geoghegan