[WIP] [B-Tree] Keep indexes sorted by heap physical location

Started by Claudio Freireover 9 years ago34 messageshackers
Jump to latest
#1Claudio Freire
klaussfreire@gmail.com

The attached patch tries to maintain the initial status of B-Tree
indexes, which are created with equal-key runs in physical order,
during the whole life of the B-Tree, and make key-tid pairs
efficiently searchable in the process.

The patch doesn't add an interface to perform the key-tid pairs, nor
does it address all potential issues that could arise from the
implementation, or backward compatibility with earlier-format indexes,
but passes all tests, and seems to work well with the workloads I have
tried until now, which include:

- dump/restore/compare (with lots of indexes, both unique and non)
- sanity check of index lookups
- running a complex application on top of it, no crashes or
corruption (detected)

Recovery hasn't been tested at all, and I'd expect it to have issues -
though I have tried to make the necessary changes I may have missed
more than one. Nor have I tested high-concurrency workloads (although
the application mentioned does produce some of it), so this is an
early WIP in search of feedback on the overall approach. Still, it
should be enough to measure the merits of the idea, namely index size,
performance and code impact.

A couple of points make me uneasy about this patch, yet I can think of
no better alternative, so I seek feedback:

- introducing a different format for inner index tuples makes for an
invasive patch and quite difficult-to-reason code (it's easy to forget
whether a page is leaf or inner and that now causes assertion failures
or segfaults)
- the ascent-descent to check for uniqueness when there are large
dead tuple runs could have locking subtleties that escape me. Perhaps
there's better ways to solve this.
- the patch changes the output of regression tests in
nondeterministic ways, making index scan results always follow
physical order. While that's not only not incorrect but desirable,
lots of tests depended on the almost deterministic order that resulted
from the way a suitable insertion point was selected before. The first
patch addresses that, but at the expense of some "coverage" (ignoring
some lines of reference output, the minimum I could ignore).
- while I don't see why it would cause more bloat under normal usage
than the previous method, there are probably new/different
pathological cases to account for, and maybe I'm missing one that is a
common pattern
- it's btree, it doesn't get any more critical than that, it really
needs a heapload of testing before going in, and some of it (crash
recovery) may be tough to get right

On the performance side, I don't see a significant impact. If
anything, it has the potential to speed up things, since scans over
long runs of equal keys will now be done in physical order, and
insertions will be evenly spread around the index leaf pages (due to
the spread of heap tuples), but I haven't done any benchmarks beyond
pgbench to back that up.

It is also an important enabling feature to make vacuum more
efficient, and to implement WARM tuples in a more general way, so that
has to be accounted for when evaluating the potential benefit from
this patch.

So, the numbers.

The size impact I measured on a dump of a test database, and seems
under the noise, which is what I expected since it only affects
non-leaf pages:

patched

db: 23G
hourly_full: 562M + 790M (118M + 118M + 554M)
bid_logs: 591M + 268M (134M + 134M)
uas: 285M + 496M (20M + 138M + 159M + 20M + 159M)

unpatched

db: 23G
hourly_full: 562M + 789M (118M + 118M + 553M)
bid_logs: 591M + 268M (134M + 134M)
uas: 285M + 496M (20M + 138M + 159M + 20M + 159M)

Here, numbers are HEAP + INDEXES (i1 + i2 + i3 ...), for tables that
are a mix of wide and narrow, wide indexed columns (varchar) and
narrow (int). Only a few indexes seem to have grown, but this is
before any bloat accumulates (I haven't devised a test yet that
produces the same amount of bloat on both databases to make them
comparable).

pgbench, in-memory, patched:

transaction type: <builtin: select only>
scaling factor: 100
query mode: simple
number of clients: 8
number of threads: 1
duration: 300 s
number of transactions actually processed: 13271118
latency average: 0.181 ms
tps = 44237.026822 (including connections establishing)
tps = 44237.227678 (excluding connections establishing)

pgbench, in-memory, unpatched:

transaction type: <builtin: select only>
scaling factor: 100
query mode: simple
number of clients: 8
number of threads: 1
duration: 300 s
number of transactions actually processed: 13336128
latency average: 0.180 ms
tps = 44453.718954 (including connections establishing)
tps = 44453.895436 (excluding connections establishing)

pgbench, bigger-than-RAM, patched:

transaction type: <builtin: select only>
scaling factor: 1000
query mode: simple
number of clients: 8
number of threads: 1
duration: 300 s
number of transactions actually processed: 71194
latency average: 33.711 ms
tps = 237.269144 (including connections establishing)
tps = 237.270122 (excluding connections establishing)

transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1000
query mode: simple
number of clients: 8
number of threads: 1
duration: 300 s
number of transactions actually processed: 26748
latency average: 89.726 ms
tps = 89.043800 (including connections establishing)
tps = 89.044218 (excluding connections establishing)

pgbench, bigger-than-RAM, unpatched:

transaction type: <builtin: select only>
scaling factor: 1000
query mode: simple
number of clients: 8
number of threads: 1
duration: 300 s
number of transactions actually processed: 86495
latency average: 27.747 ms
tps = 288.256959 (including connections establishing)
tps = 288.258202 (excluding connections establishing)

transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1000
query mode: simple
number of clients: 8
number of threads: 1
duration: 300 s
number of transactions actually processed: 27145
latency average: 88.414 ms
tps = 90.461600 (including connections establishing)
tps = 90.462105 (excluding connections establishing)

Which is expected since pgbench exercises none of the potentially
improved paths. I/O here (just a laptop) is horrible so the difference
is probably just noise. The drop in select-only seems strangely high,
I think that's also noise, but I haven't run the tests again yet.

Attachments:

0001-pg_regress-allow-ignoring-some-lines.patchtext/x-patch; charset=US-ASCII; name=0001-pg_regress-allow-ignoring-some-lines.patchDownload+70-5
0002-B-Tree-Keep-indexes-sorted-by-physical-location.patchtext/x-patch; charset=US-ASCII; name=0002-B-Tree-Keep-indexes-sorted-by-physical-location.patchDownload+680-294
#2Robert Haas
robertmhaas@gmail.com
In reply to: Claudio Freire (#1)
Re: [WIP] [B-Tree] Keep indexes sorted by heap physical location

On Wed, Aug 17, 2016 at 10:54 PM, Claudio Freire <klaussfreire@gmail.com> wrote:

The attached patch tries to maintain the initial status of B-Tree
indexes, which are created with equal-key runs in physical order,
during the whole life of the B-Tree, and make key-tid pairs
efficiently searchable in the process.

I have two pretty important questions that doesn't seem to be
addressed in your email.

1. How does it do that?

2. What's the benefit?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Claudio Freire
klaussfreire@gmail.com
In reply to: Robert Haas (#2)
Re: [WIP] [B-Tree] Keep indexes sorted by heap physical location

On Thu, Aug 18, 2016 at 5:05 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Aug 17, 2016 at 10:54 PM, Claudio Freire <klaussfreire@gmail.com> wrote:

The attached patch tries to maintain the initial status of B-Tree
indexes, which are created with equal-key runs in physical order,
during the whole life of the B-Tree, and make key-tid pairs
efficiently searchable in the process.

I have two pretty important questions that doesn't seem to be
addressed in your email.

I believe I addressed that in the README, but if I wasn't clear, let
me know and I'll expand there too.

Summarizing here:

1. How does it do that?

It adds the block number and the offset number of the index tuple's
t_tid (that is, the pointer into the heap) as a final (implicit) sort
key, as is done already in tuplesort. It just keeps doing that while
comparing keys in the B-Tree when searching the leaf page for
insertion, so the insertion point now points to the exact point where
the insertion has to happen to maintain that condition, and not just
to the first valid position within the same-key run.

In fact, that's why non-leaf index tuples need a different format,
because while leaf index tuples contain the heap pointer already,
non-leaf ones contain only the downlink, not the pointer into the
heap. To be able to do comparisons and pick the right downlink, the
original heap pointer in the leaf index tuple is copied into the
downlink index tuple when splitting pages into an additional
IndexTupleData header that is prepended only to non-leaf index tuples.

This representation is chosen to minimize code changes, such that
(itup+1) is usable as before, and also because it's reasonably
compact. But it *is* necessary to check whether it's a leaf or
non-leaf tuple to choose whether to use (itup+1) or just itup, which
is why the patch requires so many changes in so many places.

2. What's the benefit?

The idea came up in the discussion about WARM updates.

Basically, in various situations it is convenient to be able to find
the index tuples that point to a specific heap tuple. Without this,
doing so isn't efficient - the only way to find such index tuples is
to find the leftmost index tuple that has the same key, and walk the
index right links until the right tid is found. For non-unique
indexes, but also for unique indexes with heavy bloat, this could
involve a large amount of I/O, so it isn't efficient in several
contexts. Think vacuum and WARM updates mostly (like HOT updates, but
where only some indexes don't need updating, and some do).

Having the ability to find a heap tuple by key-ctid pair is thus
beneficial because it allows efficient implementations for those
operations. It may benefit other parts of the code, but those are the
ones that come to mind.

It also makes index scans of the form

SELECT * FROM t WHERE some_col = some_const;

Scan the heap in sequential order, even if some_col has low
cardinality (ie: the query returns lots of tuples), which is a nice
performance side effect.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Claudio Freire (#3)
Re: [WIP] [B-Tree] Keep indexes sorted by heap physical location

On Thu, Aug 18, 2016 at 3:41 PM, Claudio Freire <klaussfreire@gmail.com> wrote:

It also makes index scans of the form

SELECT * FROM t WHERE some_col = some_const;

Scan the heap in sequential order, even if some_col has low
cardinality (ie: the query returns lots of tuples), which is a nice
performance side effect.

Speaking of performance side effects, does this avoid O(N^2)
performance on index tuple insertion with duplicate values, for all
insertion orderings? For example, does it descend directly to the
right leaf page for the insert rather than starting at the front of
the block of duplicate values and scanning to the right for a
block with space, with a random chance to split a full block on
each page it moves through?

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Claudio Freire
klaussfreire@gmail.com
In reply to: Kevin Grittner (#4)
Re: [WIP] [B-Tree] Keep indexes sorted by heap physical location

On Thu, Aug 18, 2016 at 6:04 PM, Kevin Grittner <kgrittn@gmail.com> wrote:

On Thu, Aug 18, 2016 at 3:41 PM, Claudio Freire <klaussfreire@gmail.com> wrote:

It also makes index scans of the form

SELECT * FROM t WHERE some_col = some_const;

Scan the heap in sequential order, even if some_col has low
cardinality (ie: the query returns lots of tuples), which is a nice
performance side effect.

Speaking of performance side effects, does this avoid O(N^2)
performance on index tuple insertion with duplicate values, for all
insertion orderings? For example, does it descend directly to the
right leaf page for the insert rather than starting at the front of
the block of duplicate values and scanning to the right for a
block with space, with a random chance to split a full block on
each page it moves through?

Yes, but only on non-unique indexes.

Unique indexes still need to scan all duplicates to check visibility
and will become O(N^2) there.

The code with the random chance to split is still there, but it should
never have a chance to run because the comparison against the heap
tuple pointer would stop it very quickly (before walking a single
offset I believe).

I thought about cleaning that up, but to make review easier I thought
I'd leave it there for now. Removing it would add a lot of diff noise.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

In reply to: Claudio Freire (#3)
Re: [WIP] [B-Tree] Keep indexes sorted by heap physical location

On Thu, Aug 18, 2016 at 1:41 PM, Claudio Freire <klaussfreire@gmail.com> wrote:

In fact, that's why non-leaf index tuples need a different format,
because while leaf index tuples contain the heap pointer already,
non-leaf ones contain only the downlink, not the pointer into the
heap. To be able to do comparisons and pick the right downlink, the
original heap pointer in the leaf index tuple is copied into the
downlink index tuple when splitting pages into an additional
IndexTupleData header that is prepended only to non-leaf index tuples.

I think that this is a bad idea. We need to implement suffix
truncation of internal page index tuples at some point, to make them
contain less information from the original leaf page index tuple.
That's an important optimization, because it increases fan-in. This
seems like a move in the opposite direction.

ISTM that the way to address this problem is with a duplicate list
and/or prefix compression in leaf pages.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#7Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Claudio Freire (#5)
Re: [WIP] [B-Tree] Keep indexes sorted by heap physical location

Claudio Freire wrote:

Unique indexes still need to scan all duplicates to check visibility
and will become O(N^2) there.

That scenario doesn't matter, because on unique indexes there aren't
many duplicate values anyway -- only one can be a live tuple.

--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Claudio Freire
klaussfreire@gmail.com
In reply to: Peter Geoghegan (#6)
Re: [WIP] [B-Tree] Keep indexes sorted by heap physical location

On Thu, Aug 18, 2016 at 6:15 PM, Peter Geoghegan <pg@heroku.com> wrote:

On Thu, Aug 18, 2016 at 1:41 PM, Claudio Freire <klaussfreire@gmail.com> wrote:

In fact, that's why non-leaf index tuples need a different format,
because while leaf index tuples contain the heap pointer already,
non-leaf ones contain only the downlink, not the pointer into the
heap. To be able to do comparisons and pick the right downlink, the
original heap pointer in the leaf index tuple is copied into the
downlink index tuple when splitting pages into an additional
IndexTupleData header that is prepended only to non-leaf index tuples.

I think that this is a bad idea. We need to implement suffix
truncation of internal page index tuples at some point, to make them
contain less information from the original leaf page index tuple.
That's an important optimization, because it increases fan-in. This
seems like a move in the opposite direction.

I see that. I could try to measure average depth to measure the impact
this had on fan-in.

While it should cut it in half for narrow indexes, half of very high
is still high. Wide indexes, which are are the ones that would suffer
from poor fan-in, would feel this far less, since the overhead is
constant.

Even if it does have an impact, I don't see an alternative, without
also implementing suffix truncation. Perhaps I could try to avoid
adding the leaf tid header if it isn't necessary, though I would have
to use up the last available flag bit in t_info for that.

ISTM that the way to address this problem is with a duplicate list
and/or prefix compression in leaf pages.

Prefix compression is another one I will be looking into eventually,
but last time I tried it was far more invasive so I abandoned until I
could find a less invasive way to do it.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#9Tom Lane
tgl@sss.pgh.pa.us
In reply to: Claudio Freire (#5)
Re: [WIP] [B-Tree] Keep indexes sorted by heap physical location

Claudio Freire <klaussfreire@gmail.com> writes:

On Thu, Aug 18, 2016 at 6:04 PM, Kevin Grittner <kgrittn@gmail.com> wrote:

Speaking of performance side effects, does this avoid O(N^2)
performance on index tuple insertion with duplicate values, for all
insertion orderings? For example, does it descend directly to the
right leaf page for the insert rather than starting at the front of
the block of duplicate values and scanning to the right for a
block with space, with a random chance to split a full block on
each page it moves through?

Yes, but only on non-unique indexes.

How's that work if the existing entries aren't in TID order (which they
will not be, in a pre-existing index)? Or are you assuming you can blow
off on-disk compatibility of indexes?

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10Claudio Freire
klaussfreire@gmail.com
In reply to: Tom Lane (#9)
Re: [WIP] [B-Tree] Keep indexes sorted by heap physical location

On Thu, Aug 18, 2016 at 6:27 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Claudio Freire <klaussfreire@gmail.com> writes:

On Thu, Aug 18, 2016 at 6:04 PM, Kevin Grittner <kgrittn@gmail.com> wrote:

Speaking of performance side effects, does this avoid O(N^2)
performance on index tuple insertion with duplicate values, for all
insertion orderings? For example, does it descend directly to the
right leaf page for the insert rather than starting at the front of
the block of duplicate values and scanning to the right for a
block with space, with a random chance to split a full block on
each page it moves through?

Yes, but only on non-unique indexes.

How's that work if the existing entries aren't in TID order (which they
will not be, in a pre-existing index)? Or are you assuming you can blow
off on-disk compatibility of indexes?

regards, tom lane

This patch does blow off on-disk compatibility, but the plan is to
re-add it later on.

A flag in the meta page would indicate whether it's a sorted index or
not, and the only way to get a sorted index would be with a reindex.
The code would have to support both for a while until whatever point
we'd figure we can drop support for old format indexes.

Since this is a sort order change I see no alternative, either the
whole index is sorted by TID or not.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

In reply to: Claudio Freire (#8)
Re: [WIP] [B-Tree] Keep indexes sorted by heap physical location

On Thu, Aug 18, 2016 at 2:26 PM, Claudio Freire <klaussfreire@gmail.com> wrote:

I see that. I could try to measure average depth to measure the impact
this had on fan-in.

While it should cut it in half for narrow indexes, half of very high
is still high. Wide indexes, which are are the ones that would suffer
from poor fan-in, would feel this far less, since the overhead is
constant.

You'll also have to consider the impact on the choice of split point,
FWIW. There is currently leeway about what page an index tuple can
land on, if and when it happens to be equal to the high-key. I'm not
sure how important that is, but it's a consideration.

Even if it does have an impact, I don't see an alternative, without
also implementing suffix truncation. Perhaps I could try to avoid
adding the leaf tid header if it isn't necessary, though I would have
to use up the last available flag bit in t_info for that.

To be clear, I'm particularly worried about painting ourselves into a
corner, suffix-truncation-wise. It's a very common optimization, that
we should really have.

ISTM that the way to address this problem is with a duplicate list
and/or prefix compression in leaf pages.

Prefix compression is another one I will be looking into eventually,
but last time I tried it was far more invasive so I abandoned until I
could find a less invasive way to do it.

Unfortunately, sometimes it is necessary to be very ambitious in order
to solve a problem. The understandable and usually well-reasoned
approach of making progress as incremental as possible occasionally
works against contributors. It's worth considering if this is such a
case.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12Claudio Freire
klaussfreire@gmail.com
In reply to: Peter Geoghegan (#11)
Re: [WIP] [B-Tree] Keep indexes sorted by heap physical location

On Thu, Aug 18, 2016 at 6:38 PM, Peter Geoghegan <pg@heroku.com> wrote:

On Thu, Aug 18, 2016 at 2:26 PM, Claudio Freire <klaussfreire@gmail.com> wrote:

I see that. I could try to measure average depth to measure the impact
this had on fan-in.

While it should cut it in half for narrow indexes, half of very high
is still high. Wide indexes, which are are the ones that would suffer
from poor fan-in, would feel this far less, since the overhead is
constant.

You'll also have to consider the impact on the choice of split point,
FWIW. There is currently leeway about what page an index tuple can
land on, if and when it happens to be equal to the high-key. I'm not
sure how important that is, but it's a consideration.

When I read the code, it seemed generic enough, using ItemIdGetLength
(which works on non-leaf index tuples correctly, reporting the right
size), so it should still work.

But the choice of split point may make a difference for future
insertions, so I'll look into that.

Even if it does have an impact, I don't see an alternative, without
also implementing suffix truncation. Perhaps I could try to avoid
adding the leaf tid header if it isn't necessary, though I would have
to use up the last available flag bit in t_info for that.

To be clear, I'm particularly worried about painting ourselves into a
corner, suffix-truncation-wise. It's a very common optimization, that
we should really have.

Well, page version numbers could be used to switch between
semantically similar page formats later on. So if another format is
necessary (say, prefix compression, suffix truncation, etc), it can be
changed on-the-fly on existing indexes by checking page version
numbers and doing the proper switch.

So we won't be painting ourselves into a corner, but picking the wrong
format may indeed be a headache.

I can go the way of the flag in t_info if that's preferrable. Both
would work. It's just that it's the last flag available, and that
would be painting ourselves into a corner regarding flags. To avoid
that, the flag itself could be "has extra data" (instead of has leaf
tid), and add a whole set of flags in the extra data. That could also
help for preffix compression and suffix truncation.

ISTM that the way to address this problem is with a duplicate list
and/or prefix compression in leaf pages.

Prefix compression is another one I will be looking into eventually,
but last time I tried it was far more invasive so I abandoned until I
could find a less invasive way to do it.

Unfortunately, sometimes it is necessary to be very ambitious in order
to solve a problem. The understandable and usually well-reasoned
approach of making progress as incremental as possible occasionally
works against contributors. It's worth considering if this is such a
case.

I'd agree if this regressed performance considerably without the other
improvements, so I guess this will depend on the fan-in measurements.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Claudio Freire (#12)
Re: [WIP] [B-Tree] Keep indexes sorted by heap physical location

On Thu, Aug 18, 2016 at 5:04 PM, Claudio Freire <klaussfreire@gmail.com> wrote:

But the choice of split point may make a difference for future
insertions, so I'll look into that.

A database product I worked on a long time ago had an interesting
optimization for indexes that had multiple insertion points, each
of which was increasing. (Picture, for example 100 case types,
with case numbers inserted in sequence within each case type -- so
the first three felony cases would be CF000001, CF000002, and
CF000003; while the first three small claims cases would be
SC000001, SC000002, and SC000003.) That's not a terribly rare
usage pattern, in my experience. An index on such data is most
efficient if you split at the point of the index tuple being
inserted -- making it the last tuple on the left-hand page. I
don't know whether we might want to consider making that an option
somehow -- it just came to mind because of this discussion.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14Claudio Freire
klaussfreire@gmail.com
In reply to: Claudio Freire (#8)
Re: [WIP] [B-Tree] Keep indexes sorted by heap physical location

On Thu, Aug 18, 2016 at 6:26 PM, Claudio Freire <klaussfreire@gmail.com> wrote:

On Thu, Aug 18, 2016 at 6:15 PM, Peter Geoghegan <pg@heroku.com> wrote:

On Thu, Aug 18, 2016 at 1:41 PM, Claudio Freire <klaussfreire@gmail.com> wrote:

In fact, that's why non-leaf index tuples need a different format,
because while leaf index tuples contain the heap pointer already,
non-leaf ones contain only the downlink, not the pointer into the
heap. To be able to do comparisons and pick the right downlink, the
original heap pointer in the leaf index tuple is copied into the
downlink index tuple when splitting pages into an additional
IndexTupleData header that is prepended only to non-leaf index tuples.

I think that this is a bad idea. We need to implement suffix
truncation of internal page index tuples at some point, to make them
contain less information from the original leaf page index tuple.
That's an important optimization, because it increases fan-in. This
seems like a move in the opposite direction.

I see that. I could try to measure average depth to measure the impact
this had on fan-in.

While it should cut it in half for narrow indexes, half of very high
is still high. Wide indexes, which are are the ones that would suffer
from poor fan-in, would feel this far less, since the overhead is
constant.

Even if it does have an impact, I don't see an alternative, without
also implementing suffix truncation. Perhaps I could try to avoid
adding the leaf tid header if it isn't necessary, though I would have
to use up the last available flag bit in t_info for that.

So, I did some tests.

TL;DR version is, as expected, there is a big impact on narrow
indexes, but I don't believe it's something to be worried about.

**Begin LONG version** (skip to end long version if not interested)

Regardless of the fanout (or fanin, if you look at it that way), what
matters is tree depth, so I used pageinspect to check how depth of
indexes changed after patching.

This is all on pristine recently built indexes. I will try to measure
the same on bloated indexes later on, but the tests are painfully
slow.

I used the following query to inspect all user indexes, since
pageinspect failed to work for toast indexes for some reason (it
probably needed qualified relnames or something like that). Anyway
user indexes should be enough.

select level, count(*), pg_size_pretty(max(relpages) * 8192.0) as biggest
from pg_class, bt_metap(relname)
where relkind = 'i' and relnamespace = 2200 and relam = 403
group by level
order by level desc;

I tested it on both patched and unpached versions of both my test
database (roughly 23GB), pgbench scale 1000 (15GB) and pgbench scale
10000 (146GB).

I believe pgbench is a worst case example, because it only has very
narrow PK indexes, which are the ones that will suffer the most from
this patch.

The numbers:

mat, patched

level;count;biggest
3;18;"1369 MB"
2;60;"322 MB"
1;26;"808 kB"
0;50;"16 kB"

mat, unpatched

level;count;biggest
3;15;"1367 MB"
2;63;"334 MB"
1;26;"800 kB"
0;49;"16 kB"

pgbench, scale 1000, patched

level;count;biggest
3;1;"2145 MB"
1;2;"456 kB"

pgbench, scale 1000, unpatched

level;count;biggest
3;1;"2142 MB"
1;2;"456 kB"

pgbench, scale 10000, patched

level;count;biggest
3;1;"21 GB"
2;1;"2224 kB"
1;1;"240 kB"

pgbench, scale 10000, unpatched

level;count;biggest
3;1;"21 GB"
1;2;"2208 kB"

So, clearly some indexes are pushed to the next level, but it isn't a
widespread effect - it looks more like the threshold index size for
needing a root split is slightly pushed downwards.

It totally agrees with the math. The index tuple header is 8 bytes,
and the datums need to be maxaligned so they will also be 8 bytes at
least. That means the B in B-tree gets cut by 50% (2/3) at the worst
but only for inner tuples, and so you get that

if a * log_(B)(N/B) = log_(B/(3/2))(N/B)

then for big enogh N

a < (3/2)*log_(B)(N) - 2

Which means the depth of huge B-trees is increased by 50%, but the
effects only begins to be seen at level 2, which is what we have here,
and level 2 for such a narrow index seems to be about 2GB. So we're
talking about very big indexes already. If level 2 can hold 2GB of
index, level 3 can hold around 2G x 8k / 16 = 1TB of index, and I have
yet to see (even in very big databases) a 1TB index on a PK.

If anyone has such an index, yes, this patch will hurt performance by
25% (4 page reads instead of 3).

Bad, but since it only affects very extreme cases, doesn't seem so bad.

**End LONG version**

So, that said, I started mocking up a version of the patch that used a
"has aux data" flag to signal another set of flags from where variable
size attributes can be read, which would enable implementation of
suffix truncation and prefix compression in the future with minimal
data format changes, and was surprised to find that it seems not only
more compact, but cleaner. So I will probably go that way, and do
that.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15Amit Kapila
amit.kapila16@gmail.com
In reply to: Claudio Freire (#1)
Re: [WIP] [B-Tree] Keep indexes sorted by heap physical location

On Thu, Aug 18, 2016 at 8:24 AM, Claudio Freire <klaussfreire@gmail.com> wrote:

A couple of points make me uneasy about this patch, yet I can think of
no better alternative, so I seek feedback:

- introducing a different format for inner index tuples makes for an
invasive patch and quite difficult-to-reason code (it's easy to forget
whether a page is leaf or inner and that now causes assertion failures
or segfaults)
- the ascent-descent to check for uniqueness when there are large
dead tuple runs could have locking subtleties that escape me. Perhaps
there's better ways to solve this.

I have tried to study this part of your patch and it seems somewhat
non-trivial and risky part of proposal.

+ } else {
+ /*
+ * We found the actual first item on another block, so we have to perform
+ * a two-step search - first half, until our write-locked buffer, then another
+ * starting from our write-locked buffer.
+ */
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+ LockBuffer(buf, BT_WRITE);
+
+ buf = _bt_moveright(rel, buf, natts, itup_scankey, &(itup->t_tid), false,
+ true, stack, BT_WRITE, NULL);
+
+ first_offset = _bt_binsrch(rel, buf, natts, itup_scankey, NULL, false, NULL);
+
+ xwait = _bt_check_unique(rel, itup, heapRel, nbuf, buf,
first_offset, itup_scankey,
+ checkUnique, &is_unique, &speculativeToken);
+
+ _bt_relbuf(rel, nbuf);
+ }

The idea for uniqueness check is that we hold the write lock on
buffer/page on which we are trying to operate (we do take read locks
on the consecutive pages during this check). Here, in your changed
proposal, you have two buffers (one to which the search led you and
one buffer previous in the chain) and before checking uniqueness, on
one of the buffers you seem to have write lock and on other you have
read lock. This seems to change the way uniqueness check works till
now, can you explain how this works (can we safely assume that all
searches for uniqueness check will lead to the same buffer first).

With this new mechanism, do we have two type of search interfaces
where one would work for keys (as now) and other for key-ctid or it
will be just a single interface which works both ways? I think either
way there are pros and cons.

Also, in the thread below you are talking about using the last bit in
t_info, I want to bring it to your notice that there is a patch of
mine [1]/messages/by-id/CAA4eK1LkQ_Udism-Z2Dq6cUvjH3dB5FNFNnEzZBPsRjw0haFqA@mail.gmail.com -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com in which I have used it to avoid changing on-disk
compatibility of hash indexes. I am not saying that we should not
plan it for some other use, but just to keep in mind that there are
other proposals to use it. We can decide what is best way to proceed,
if we are aware of all the proposals that want to use it.

[1]: /messages/by-id/CAA4eK1LkQ_Udism-Z2Dq6cUvjH3dB5FNFNnEzZBPsRjw0haFqA@mail.gmail.com -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16Claudio Freire
klaussfreire@gmail.com
In reply to: Amit Kapila (#15)
Re: [WIP] [B-Tree] Keep indexes sorted by heap physical location

On Sat, Aug 20, 2016 at 1:05 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Aug 18, 2016 at 8:24 AM, Claudio Freire <klaussfreire@gmail.com> wrote:

A couple of points make me uneasy about this patch, yet I can think of
no better alternative, so I seek feedback:

- introducing a different format for inner index tuples makes for an
invasive patch and quite difficult-to-reason code (it's easy to forget
whether a page is leaf or inner and that now causes assertion failures
or segfaults)
- the ascent-descent to check for uniqueness when there are large
dead tuple runs could have locking subtleties that escape me. Perhaps
there's better ways to solve this.

I have tried to study this part of your patch and it seems somewhat
non-trivial and risky part of proposal.

+ } else {
+ /*
+ * We found the actual first item on another block, so we have to perform
+ * a two-step search - first half, until our write-locked buffer, then another
+ * starting from our write-locked buffer.
+ */
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+ LockBuffer(buf, BT_WRITE);
+
+ buf = _bt_moveright(rel, buf, natts, itup_scankey, &(itup->t_tid), false,
+ true, stack, BT_WRITE, NULL);
+
+ first_offset = _bt_binsrch(rel, buf, natts, itup_scankey, NULL, false, NULL);
+
+ xwait = _bt_check_unique(rel, itup, heapRel, nbuf, buf,
first_offset, itup_scankey,
+ checkUnique, &is_unique, &speculativeToken);
+
+ _bt_relbuf(rel, nbuf);
+ }

The idea for uniqueness check is that we hold the write lock on
buffer/page on which we are trying to operate (we do take read locks
on the consecutive pages during this check). Here, in your changed
proposal, you have two buffers (one to which the search led you and
one buffer previous in the chain) and before checking uniqueness, on
one of the buffers you seem to have write lock and on other you have
read lock. This seems to change the way uniqueness check works till
now, can you explain how this works (can we safely assume that all
searches for uniqueness check will lead to the same buffer first).

All uniqueness checks will find the same "nbuf" (read-locked), which
is the first one that matches the key.

They could have different "buf" buffers (the write-locked) one. But
since read locks cannot be held while a write-lock is held, I believe
it doesn't matter. All uniqueness checks will try to read-lock nbuf
unless the uniqueness check for that insertion is done. When they do,
they will block until the other insertion is finished, and when that
happens, either the insert happened, or it returned an xid to wait for
and retry. If it succeeded, the check attempting the read-lock will
see the inserted tuple and return the xid of the earlier transaction
and wait on it. If it failed, no insert happened because there's a
visible tuple there, and the read-lock will succeed, find the visible
tuple, and the insert will fail too.

In essence, it is still the case, I believe, that only one insert can
succeed at a time, all the others will retry. It only happens that
with the patch, the contention will be between a read lock and a write
lock, and not two write locks, and there's a little less contention
(some more work can be done in parallel).

With this new mechanism, do we have two type of search interfaces
where one would work for keys (as now) and other for key-ctid or it
will be just a single interface which works both ways? I think either
way there are pros and cons.

The patch doesn't add another interface, but the idea is to add it.
The implementation will probably fall on a common function that can do
both, but some index ams can't implement the key-ctid operation (hash
indexes), so they will have to be separate interfaces to be able to
properly represent the separate capabilities (besides, other
implementations may well use completely different paths for the two
operations).

Also, in the thread below you are talking about using the last bit in
t_info, I want to bring it to your notice that there is a patch of
mine [1] in which I have used it to avoid changing on-disk
compatibility of hash indexes. I am not saying that we should not
plan it for some other use, but just to keep in mind that there are
other proposals to use it. We can decide what is best way to proceed,
if we are aware of all the proposals that want to use it.

[1] - /messages/by-id/CAA4eK1LkQ_Udism-Z2Dq6cUvjH3dB5FNFNnEzZBPsRjw0haFqA@mail.gmail.com

I wasn't aware of that particular thread, but there's another (the
WARM one) where there's another proposed use of that bit.

IMHO, any use of the bit that doesn't leave room for further
extensions of the bit space is a bad idea, because it closes the door
on (easy) further flags being added. And the fact that there's already
several proposals to use that bit for different purposes only
reinforces that idea.

The idea I'm working on allows further extension, and, although it
uses more space, I believe it's a better way. But we'll discuss that
when I have the implementation I guess.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17Claudio Freire
klaussfreire@gmail.com
In reply to: Claudio Freire (#16)
Re: [WIP] [B-Tree] Keep indexes sorted by heap physical location

On Sat, Aug 20, 2016 at 1:53 AM, Claudio Freire <klaussfreire@gmail.com> wrote:

All uniqueness checks will try to read-lock nbuf
unless the uniqueness check for that insertion is done.

That should read "all uniqueness checks will try to read-lock the buf
unless the uniqueness check for that insertion is done."

(nbuf -> buf)

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18Amit Kapila
amit.kapila16@gmail.com
In reply to: Claudio Freire (#16)
Re: [WIP] [B-Tree] Keep indexes sorted by heap physical location

On Sat, Aug 20, 2016 at 10:23 AM, Claudio Freire <klaussfreire@gmail.com> wrote:

On Sat, Aug 20, 2016 at 1:05 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Aug 18, 2016 at 8:24 AM, Claudio Freire <klaussfreire@gmail.com> wrote:

A couple of points make me uneasy about this patch, yet I can think of
no better alternative, so I seek feedback:

- introducing a different format for inner index tuples makes for an
invasive patch and quite difficult-to-reason code (it's easy to forget
whether a page is leaf or inner and that now causes assertion failures
or segfaults)
- the ascent-descent to check for uniqueness when there are large
dead tuple runs could have locking subtleties that escape me. Perhaps
there's better ways to solve this.

I have tried to study this part of your patch and it seems somewhat
non-trivial and risky part of proposal.

+ } else {
+ /*
+ * We found the actual first item on another block, so we have to perform
+ * a two-step search - first half, until our write-locked buffer, then another
+ * starting from our write-locked buffer.
+ */
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+ LockBuffer(buf, BT_WRITE);
+
+ buf = _bt_moveright(rel, buf, natts, itup_scankey, &(itup->t_tid), false,
+ true, stack, BT_WRITE, NULL);
+
+ first_offset = _bt_binsrch(rel, buf, natts, itup_scankey, NULL, false, NULL);
+
+ xwait = _bt_check_unique(rel, itup, heapRel, nbuf, buf,
first_offset, itup_scankey,
+ checkUnique, &is_unique, &speculativeToken);
+
+ _bt_relbuf(rel, nbuf);
+ }

The idea for uniqueness check is that we hold the write lock on
buffer/page on which we are trying to operate (we do take read locks
on the consecutive pages during this check). Here, in your changed
proposal, you have two buffers (one to which the search led you and
one buffer previous in the chain) and before checking uniqueness, on
one of the buffers you seem to have write lock and on other you have
read lock. This seems to change the way uniqueness check works till
now, can you explain how this works (can we safely assume that all
searches for uniqueness check will lead to the same buffer first).

All uniqueness checks will find the same "nbuf" (read-locked), which
is the first one that matches the key.

They could have different "buf" buffers (the write-locked) one. But
since read locks cannot be held while a write-lock is held, I believe
it doesn't matter. All uniqueness checks will try to read-lock nbuf
unless the uniqueness check for that insertion is done. When they do,
they will block until the other insertion is finished, and when that
happens, either the insert happened, or it returned an xid to wait for
and retry. If it succeeded, the check attempting the read-lock will
see the inserted tuple and return the xid of the earlier transaction
and wait on it. If it failed, no insert happened because there's a
visible tuple there, and the read-lock will succeed, find the visible
tuple, and the insert will fail too.

In essence, it is still the case, I believe, that only one insert can
succeed at a time, all the others will retry. It only happens that
with the patch, the contention will be between a read lock and a write
lock, and not two write locks, and there's a little less contention
(some more work can be done in parallel).

With this new mechanism, do we have two type of search interfaces
where one would work for keys (as now) and other for key-ctid or it
will be just a single interface which works both ways? I think either
way there are pros and cons.

The patch doesn't add another interface, but the idea is to add it.
The implementation will probably fall on a common function that can do
both

That makes sense, but this means there is a chance that the searches
could lead to different buffers in case of uniqueness checks (the
search with key-ctid could lead to a different buffer than the search
with just key). I am not clear do we have requirement for doing this
uniqueness check for key-ctid search API, because as I understand you
want to do it mainly for vacuum and WARM tuples. Vacuum won't add new
tuples, so is this required for WARM tuples?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19Claudio Freire
klaussfreire@gmail.com
In reply to: Amit Kapila (#18)
Re: [WIP] [B-Tree] Keep indexes sorted by heap physical location

On Sat, Aug 20, 2016 at 4:27 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

All uniqueness checks will find the same "nbuf" (read-locked), which
is the first one that matches the key.

They could have different "buf" buffers (the write-locked) one. But
since read locks cannot be held while a write-lock is held, I believe
it doesn't matter. All uniqueness checks will try to read-lock nbuf
unless the uniqueness check for that insertion is done. When they do,
they will block until the other insertion is finished, and when that
happens, either the insert happened, or it returned an xid to wait for
and retry. If it succeeded, the check attempting the read-lock will
see the inserted tuple and return the xid of the earlier transaction
and wait on it. If it failed, no insert happened because there's a
visible tuple there, and the read-lock will succeed, find the visible
tuple, and the insert will fail too.

In essence, it is still the case, I believe, that only one insert can
succeed at a time, all the others will retry. It only happens that
with the patch, the contention will be between a read lock and a write
lock, and not two write locks, and there's a little less contention
(some more work can be done in parallel).

With this new mechanism, do we have two type of search interfaces
where one would work for keys (as now) and other for key-ctid or it
will be just a single interface which works both ways? I think either
way there are pros and cons.

The patch doesn't add another interface, but the idea is to add it.
The implementation will probably fall on a common function that can do
both

That makes sense, but this means there is a chance that the searches
could lead to different buffers in case of uniqueness checks (the
search with key-ctid could lead to a different buffer than the search
with just key). I am not clear do we have requirement for doing this
uniqueness check for key-ctid search API, because as I understand you
want to do it mainly for vacuum and WARM tuples. Vacuum won't add new
tuples, so is this required for WARM tuples?

Well, I'm not realy sure what exactly would need to be done when doing
the WARM conditional insert in the case of unique indexes, and that's
a strong motivation to not add the interface for inserts just yet.
Vacuum will only need to delete, and in the case of deletes the
operation would be quite straight forward.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20Amit Kapila
amit.kapila16@gmail.com
In reply to: Claudio Freire (#19)
Re: [WIP] [B-Tree] Keep indexes sorted by heap physical location

On Sat, Aug 20, 2016 at 9:58 PM, Claudio Freire <klaussfreire@gmail.com> wrote:

On Sat, Aug 20, 2016 at 4:27 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

That makes sense, but this means there is a chance that the searches
could lead to different buffers in case of uniqueness checks (the
search with key-ctid could lead to a different buffer than the search
with just key). I am not clear do we have requirement for doing this
uniqueness check for key-ctid search API, because as I understand you
want to do it mainly for vacuum and WARM tuples. Vacuum won't add new
tuples, so is this required for WARM tuples?

Well, I'm not realy sure what exactly would need to be done when doing
the WARM conditional insert in the case of unique indexes, and that's
a strong motivation to not add the interface for inserts just yet.
Vacuum will only need to delete, and in the case of deletes the
operation would be quite straight forward.

Right. I think if you initially modify the interface only for deletes
that will reduce the complexity of patch as well. We can later
enhance it to handle WARM tuple case.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

In reply to: Peter Geoghegan (#6)
#22Claudio Freire
klaussfreire@gmail.com
In reply to: Peter Geoghegan (#21)
In reply to: Claudio Freire (#22)
#24Claudio Freire
klaussfreire@gmail.com
In reply to: Peter Geoghegan (#23)
In reply to: Claudio Freire (#24)
#26Claudio Freire
klaussfreire@gmail.com
In reply to: Peter Geoghegan (#25)
In reply to: Claudio Freire (#26)
In reply to: Claudio Freire (#1)
#29Claudio Freire
klaussfreire@gmail.com
In reply to: Peter Geoghegan (#28)
#30Claudio Freire
klaussfreire@gmail.com
In reply to: Peter Geoghegan (#27)
In reply to: Claudio Freire (#30)
#32Claudio Freire
klaussfreire@gmail.com
In reply to: Peter Geoghegan (#31)
In reply to: Claudio Freire (#32)
#34Claudio Freire
klaussfreire@gmail.com
In reply to: Peter Geoghegan (#33)