No heap lookups on index

Started by David Scottabout 20 years ago38 messageshackersgeneral
Jump to latest
#1David Scott
davids@apptechsys.com
hackersgeneral

Allow me a brief introduction. I work in a company who contracts
intelligence analysis software to the government. We are currently
developing a product which is using PostgreSQL at it's core. Due to the
licensing of the product and the integration with perl this is our first
choice in database solutions.

We are, however, currently stuck. We are storing millions of rows and
require very high query performance. We have spent the last several
months tweaking, list lurking and researching all the various tweaks and
performance enhancements and have come to the conclusion that our
biggest slowdown is validating the index rows which match our selection
criteria against the heap values. In general cases there is a very
small amount required for this, but in our extreme use cases we are
finding this to slow our queries by an unacceptable amount of time.

We would like to resolve this issue. In that endeavor we have done some
feasibility analysis (either to write a patch ourselves or attempt to
commission an expert to do so), starting with the archives for this
list. We found several posts discussing the issue and it seems that the
complexity of storing the tuple visibility information inside of the
index rows is prohibitive for simple indexes.

I have used SQL Server in the past and have noticed that bookmark
lookups are avoided because they force the query executor to actually
fetch the data page off of disk, rather then return the values that
exist in the index. I have verified times against the PostgreSQL
installation and SQL Server to verify that the SQL Server queries come
back at roughly the same speed when avoiding bookmark lookups as
Postgres queries accessing clustered tables using the index the table is
clustered on.

Since I am sure everyone is tired of the intro by now, I'll get to the
questions:
Do commercial databases implement MVCC in a way that allows an
efficient implementation of index lookups that can avoid heap lookups?
Is there any way to modify PostgreSQL to allow index lookups without
heap validation that doesn't involve re-writing the MVCC implementation
of keeping dead rows on the live table?
Is the additional overhead of keeping full tuple visibility
information inside of the index so odious to the Postgres community as
to prevent a patch with this solution from being applied back to the
head? Maybe as an optional use feature? We would prefer this solution
for our needs over the bitmap of heap pages listed in the TODO list
because we want to ensure optimal query times, regardless of the state
of the cache and because we are concerned with performance in the face
of concurrent updates on the page level.

Thanks for any thoughts on this, I know this is a perennial topic, but
we are seriously considering contributing either code or money to the
solution of this problem.

David Scott
Applied Technical Systems, Inc.

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: David Scott (#1)
hackersgeneral
Re: No heap lookups on index

David Scott <davids@apptechsys.com> writes:

Is the additional overhead of keeping full tuple visibility
information inside of the index so odious to the Postgres community as
to prevent a patch with this solution from being applied back to the
head?

This has been discussed and rejected before (multiple times). If you
want it considered you'll have to present stronger arguments than have
so far been made. The current consensus is that the probability of a
net performance win is not good enough to justify the large amount of
development effort that would be required.

What sort of problems are you dealing with exactly? There has been
some discussion of changes that would improve certain scenarios. For
instance it might be plausible to do joins using index information and
only go back to the heap for entries that appear to pass the join test.

regards, tom lane

#3Jonah H. Harris
jonah.harris@gmail.com
In reply to: Tom Lane (#2)
hackersgeneral
Re: No heap lookups on index

David,

You can find some of this discussion in "Much Ado About COUNT(*)". Related
to that discussion, I had written a patch which added visibility information
to the indexes.

If you're interested in the patch and/or consulting, contact me offline.

-Jonah

Show quoted text

On 1/18/06, Tom Lane <tgl@sss.pgh.pa.us> wrote:

David Scott <davids@apptechsys.com> writes:

Is the additional overhead of keeping full tuple visibility
information inside of the index so odious to the Postgres community as
to prevent a patch with this solution from being applied back to the
head?

This has been discussed and rejected before (multiple times). If you
want it considered you'll have to present stronger arguments than have
so far been made. The current consensus is that the probability of a
net performance win is not good enough to justify the large amount of
development effort that would be required.

What sort of problems are you dealing with exactly? There has been
some discussion of changes that would improve certain scenarios. For
instance it might be plausible to do joins using index information and
only go back to the heap for entries that appear to pass the join test.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster

#4Simon Riggs
simon@2ndQuadrant.com
In reply to: David Scott (#1)
hackersgeneral
Re: No heap lookups on index

On Wed, 2006-01-18 at 12:14 -0800, David Scott wrote:

Is the additional overhead of keeping full tuple visibility
information inside of the index so odious to the Postgres community
as
to prevent a patch with this solution from being applied back to the
head? Maybe as an optional use feature?

You might want to consider the thought of "organised heaps" as an
alternative thought to index improvements. That way there is no heap to
avoid visiting because the index is also the main data structure.

Teradata provides hash or value-ordered tables
Oracle offers index organised tables
DB2 offers multi-dimensional clustering
Tandem offered value ordered tables

This would offer performance, but would be one of the largest patches
seen in recent times. You may find some co-backers.

Best Regards, Simon Riggs

#5Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: David Scott (#1)
hackersgeneral
Re: No heap lookups on index

On Wed, Jan 18, 2006 at 12:14:12PM -0800, David Scott wrote:

Do commercial databases implement MVCC in a way that allows an
efficient implementation of index lookups that can avoid heap lookups?

Oracle does, but you pay in other ways. Instead of keeping dead tuples
in the main heap, they shuffle them off to an 'undo log'. This has some
downsides:

Rollbacks take *forever*, though this usually isn't much of an issue
unless you need to abort a really big transaction.

Every update/delete means two seperate writes to disk, one for the base
table and one for the undo log (well, there's also the redo log,
equivalent to our WAL). Though writes to undo can (and presumably are)
grouped together, so they should normally be a lot more efficient than
the updates to the base table unless you're updating data in table
order.

Of course there's downsides to our MVCC as well; the cost of index scans
is just one.
--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461

#6Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Jonah H. Harris (#3)
hackersgeneral
Re: No heap lookups on index

On Wed, Jan 18, 2006 at 04:02:45PM -0500, Jonah H. Harris wrote:

David,

You can find some of this discussion in "Much Ado About COUNT(*)". Related
to that discussion, I had written a patch which added visibility information
to the indexes.

If you're interested in the patch and/or consulting, contact me offline.

Does the patch change all indexes across the board? Do you have any
performance numbers?

I suspect that in some situations storing visibility info in the index
would be a big win; if that's the case it would be very good if there
was an option that allowed it. Perhaps this could be done using a
different index access method...
--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461

#7Glen Parker
glenebob@nwlink.com
In reply to: Tom Lane (#2)
hackersgeneral
Re: [HACKERS] No heap lookups on index

Tom Lane wrote:

David Scott <davids@apptechsys.com> writes:

Is the additional overhead of keeping full tuple visibility
information inside of the index so odious to the Postgres community as
to prevent a patch with this solution from being applied back to the
head?

This has been discussed and rejected before (multiple times). If you
want it considered you'll have to present stronger arguments than have
so far been made. The current consensus is that the probability of a
net performance win is not good enough to justify the large amount of
development effort that would be required.

What ever happened to grouped heap reads, i.e. building a list of tuples
from the index, sorting in heap order, then reading the heap in a batch?
The last I remember (maybe two years ago), it was being discussed but
no design decisions had been made. Is that what you're talking about?

-Glen

#8Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#4)
hackersgeneral
Re: No heap lookups on index

Simon Riggs <simon@2ndquadrant.com> writes:

You might want to consider the thought of "organised heaps" as an
alternative thought to index improvements. That way there is no heap to
avoid visiting because the index is also the main data structure.
This would offer performance, but would be one of the largest patches
seen in recent times. You may find some co-backers.

Either way it would be a pretty monstrous patch :-( ... in this case
because of the amount of code that knows about the properties of heap
storage, and in what David is thinking about because of the implications
of trying to keep multiple copies of tuple state up-to-date.

We'd probably end up with a cleaner system structure if we tried to
create an API separating out the knowledge of heap structure, but the
amount of work needed seems out of proportion to the benefit.

It might be possible to compromise though. Imagine an index that
contains only the upper levels of a search tree --- links to what
would be the leaf level point into the associated heap. In this design
the heap is still a heap in the sense that you can seqscan it without
any awareness of the index structure. What you can't do is insert
tuples or move them around without the index AM's say-so.
RelationGetBufferForTuple would become an index AM call, but otherwise
I think the impact on existing code wouldn't be large.

There are some limitations. For instance I don't think that the index
AM could control the order of items within a heap page, because of the
need for TIDs to be persistent; so within-page searches would still be
kinda slow. But it's interesting to think about.

regards, tom lane

#9Glen Parker
glenebob@nwlink.com
In reply to: David Scott (#1)
hackersgeneral
Re: [HACKERS] No heap lookups on index

Tom Lane wrote:

What ever happened to grouped heap reads, i.e. building a list of tuples
from the index, sorting in heap order, then reading the heap in a batch?

Done in 8.1. I'm uncertain whether Scott knows about that ...

That's GREAT news! Is that the "Bitmap Scan" item in the what's new
list (http://www.postgresql.org/docs/whatsnew)? I didn't even notice it
when I read it the first time. I'm really looking forward to our
upcoming 8.1 upgrade.

-Glen

#10Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#8)
hackersgeneral
Re: No heap lookups on index

On Wed, 2006-01-18 at 18:27 -0500, Tom Lane wrote:

Imagine an index that
contains only the upper levels of a search tree --- links to what
would be the leaf level point into the associated heap. In this
design
the heap is still a heap in the sense that you can seqscan it without
any awareness of the index structure. What you can't do is insert
tuples or move them around without the index AM's say-so.
RelationGetBufferForTuple would become an index AM call, but otherwise
I think the impact on existing code wouldn't be large.

Eureka! I had been thinking of a "block level index" which sounds almost
the same thing (as opposed to the row level indexes we have now). We
only need to index the row with the lowest value on any page so the main
index would get 100 times smaller. The main part of the index would not
need to be written to except when a block overflows.

I had imagined an ordering within a block to allow fast uniqueness
checks, but it would be pretty fast either way.

Merge joins with the same index become block-level joins without sorts.
We would just do an individual block sort before merging, so no need for
very large sort-merges. Even if the block level indexes differ, we only
need to sort one of the tables.

Hopefully we could avoid trying to support GIST-heaps?

Best Regards, Simon Riggs

#11Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#10)
hackersgeneral
Re: No heap lookups on index

Simon Riggs <simon@2ndquadrant.com> writes:

Hopefully we could avoid trying to support GIST-heaps?

Well, that would be an extra index AM that someone might or might not
get around to writing someday. I was thinking that both btree and hash
index AMs might be interesting for this, though. Hash in particular
would adapt pretty trivially ...

regards, tom lane

#12Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#10)
hackersgeneral
Re: No heap lookups on index

Simon Riggs <simon@2ndquadrant.com> writes:

We only need to index the row with the lowest value on any page so the main
index would get 100 times smaller. The main part of the index would not
need to be written to except when a block overflows.

BTW, the above is equivalent to saying that the leaf-level index pages
aren't there: the downlink pointers on the level-1 index pages are
pointers to heap pages, instead, and you're right that they effectively
only index the lowest value per page (actually IIRC the highest value
per page, but same difference).

I think the "100x" figure is overoptimistic though. There will be a lot
fewer entries per leaf page because actual heap tuples will be a lot
larger than index entries (typically at least). Hence, you need more
level-1 entries and so the upper index levels are bigger than in a
simple index. Another point is that the heap will be somewhat bloated
compared to a simple heap because of containing more unused space.
The traditional rule-of-thumb is that a btree index is only about 2/3rds
full at steady state, and I suppose this would apply to a
btree-organized heap too.

Still, it seems like an idea worth investigating.

Merge joins with the same index become block-level joins without sorts.
We would just do an individual block sort before merging, so no need for
very large sort-merges. Even if the block level indexes differ, we only
need to sort one of the tables.

I'd phrase that a little differently: an indexscan on such an index
would normally deliver unordered output, but you could demand ordered
output and get it by doing successive one-page sorts. I doubt it's
worth inventing a new layer of mergejoin code to do this rather than
keeping it at the index access level.

Come to think of it, the idea also seems to map nicely into bitmap index
scans: the index will directly hand back a list of potential pages to
look at, but they are all marked "lossy" because the index doesn't know
exactly which tuple(s) on the target pages match the query. The
existing bitmap-heap-scan code can take it from there.

regards, tom lane

#13Christopher Kings-Lynne
chriskl@familyhealth.com.au
In reply to: Jim Nasby (#5)
hackersgeneral
Re: No heap lookups on index

Oracle does, but you pay in other ways. Instead of keeping dead tuples
in the main heap, they shuffle them off to an 'undo log'. This has some
downsides:

Rollbacks take *forever*, though this usually isn't much of an issue
unless you need to abort a really big transaction.

It's a good point though. Surely a database should be optimised for the
most common operation - commits, rather than rollbacks?

Chris

#14Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Christopher Kings-Lynne (#13)
hackersgeneral
Re: No heap lookups on index

On Thu, Jan 19, 2006 at 09:18:55AM +0800, Christopher Kings-Lynne wrote:

Oracle does, but you pay in other ways. Instead of keeping dead tuples
in the main heap, they shuffle them off to an 'undo log'. This has some
downsides:

Rollbacks take *forever*, though this usually isn't much of an issue
unless you need to abort a really big transaction.

It's a good point though. Surely a database should be optimised for the
most common operation - commits, rather than rollbacks?

Generally true, but keep in mind this counter-argument... our MVCC
performs fewer disk writes (since generally you can find some free space
on the page you're modifying) and you can control when you take the hit
of cleaning up dead space. In fact, you can take that hit at a reduced
priority (vacuum_cost_*).
--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461

#15Tom Lane
tgl@sss.pgh.pa.us
In reply to: Christopher Kings-Lynne (#13)
hackersgeneral
Re: No heap lookups on index

Christopher Kings-Lynne <chriskl@familyhealth.com.au> writes:

Oracle does, but you pay in other ways. Instead of keeping dead tuples
in the main heap, they shuffle them off to an 'undo log'. This has some
downsides:
Rollbacks take *forever*, though this usually isn't much of an issue
unless you need to abort a really big transaction.

It's a good point though. Surely a database should be optimised for the
most common operation - commits, rather than rollbacks?

The "shuffling off" of the data is expensive in itself, so I'm not sure
you can argue that the Oracle way is more optimal for commits either.

regards, tom lane

#16Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Tom Lane (#12)
hackersgeneral
Re: No heap lookups on index

On Wed, Jan 18, 2006 at 08:13:59PM -0500, Tom Lane wrote:

Simon Riggs <simon@2ndquadrant.com> writes:

We only need to index the row with the lowest value on any page so the main
index would get 100 times smaller. The main part of the index would not
need to be written to except when a block overflows.

BTW, the above is equivalent to saying that the leaf-level index pages
aren't there: the downlink pointers on the level-1 index pages are
pointers to heap pages, instead, and you're right that they effectively
only index the lowest value per page (actually IIRC the highest value
per page, but same difference).

Would this open the door for allowing tables to be maintained in CLUSTER
order (at least at the block level if not within the blocks)? Though I
have no idea how you'd handle page splits without a lot of pain, but
perhaps it would be possible to strive for a certain tuple ordering that
would allow for a periodic re-cluster that doesn't have to move a lot of
data. One thought is to strive for the same amount of free space on each
page, so if you're touching a tuple on a page that has less than desired
free space you move it's new version to either the next or previous
page.
--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461

#17Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jim Nasby (#16)
hackersgeneral
Re: No heap lookups on index

"Jim C. Nasby" <jnasby@pervasive.com> writes:

Would this open the door for allowing tables to be maintained in CLUSTER
order (at least at the block level if not within the blocks)? Though I
have no idea how you'd handle page splits without a lot of pain

I think the way you'd attack that is by building the table with a pretty
low fill factor, so that there's room on each page for a number of
updates before you have to split. Since the index AM is going to be
dictating space allocation, this is all in its hands.

The existing CLUSTER code would probably be totally inapplicable to
this sort of organization --- we'd have to provide some alternate code
path for index-organized heaps.

regards, tom lane

#18Rod Taylor
rbt@rbt.ca
In reply to: Christopher Kings-Lynne (#13)
hackersgeneral
Re: No heap lookups on index

On Thu, 2006-01-19 at 09:18 +0800, Christopher Kings-Lynne wrote:

Oracle does, but you pay in other ways. Instead of keeping dead tuples
in the main heap, they shuffle them off to an 'undo log'. This has some
downsides:

Rollbacks take *forever*, though this usually isn't much of an issue
unless you need to abort a really big transaction.

It's a good point though. Surely a database should be optimised for the
most common operation

Yes.

- commits, rather than rollbacks?

Commits are most common because most databases are optimized for them.
Lots of programs go through a ton pre-checking to avoid a rollback that
they don't need to do under PostgreSQL.

I've found that for small systems I tend to rely very heavily on
frequent vacuums and database level exceptions for virtually all data
checking. Rollbacks are nearly as common as commits in those
environments if not more-so.
--

#19Bruce Momjian
bruce@momjian.us
In reply to: Glen Parker (#9)
hackersgeneral
Re: [HACKERS] No heap lookups on index

Glen Parker wrote:

Tom Lane wrote:

What ever happened to grouped heap reads, i.e. building a list of tuples
from the index, sorting in heap order, then reading the heap in a batch?

Done in 8.1. I'm uncertain whether Scott knows about that ...

That's GREAT news! Is that the "Bitmap Scan" item in the what's new
list (http://www.postgresql.org/docs/whatsnew)? I didn't even notice it

Yes.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#20Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#15)
hackersgeneral
Re: No heap lookups on index

Tom Lane <tgl@sss.pgh.pa.us> writes:

Christopher Kings-Lynne <chriskl@familyhealth.com.au> writes:

Oracle does, but you pay in other ways. Instead of keeping dead tuples
in the main heap, they shuffle them off to an 'undo log'. This has some
downsides:
Rollbacks take *forever*, though this usually isn't much of an issue
unless you need to abort a really big transaction.

It's a good point though. Surely a database should be optimised for the
most common operation - commits, rather than rollbacks?

The "shuffling off" of the data is expensive in itself, so I'm not sure
you can argue that the Oracle way is more optimal for commits either.

You pay in Oracle when you read these records too. If there are pending
updates you have to do a second read to the rollback segment to get the old
record. This hits long-running batch queries especially hard since by the time
they finish a large number of the records they're reading could have been
updated and require a second read to the rollback segments.

You also pay if the new value is too big to fit in the same space as the old
record. Then you get to have to follow a pointer to the new location. Oracle
tries to minimize that by intentionally leaving extra free space but that has
costs too.

And lastly rollback segments are of limited size. No matter how big you make
them there's always the risk that a long running query will take long enough
that data it needs will have expired from the rollback segments.

Oh, and note that optimizing for the common case has limits. Rollbacks may be
rare but one of the cases where they are effectively happening is on recovery
after a crash. And that's one process you *really* don't want to take longer
than necessary...

--
greg

#21Bruce Momjian
bruce@momjian.us
In reply to: David Scott (#1)
hackersgeneral
#22Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#20)
hackersgeneral
#23Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#21)
hackersgeneral
#24Jonah H. Harris
jonah.harris@gmail.com
In reply to: Tom Lane (#23)
hackersgeneral
#25Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Bruce Momjian (#19)
hackersgeneral
#26Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Bruce Momjian (#20)
hackersgeneral
#27Bruce Momjian
bruce@momjian.us
In reply to: Jonah H. Harris (#24)
hackersgeneral
#28Jonah H. Harris
jonah.harris@gmail.com
In reply to: Bruce Momjian (#27)
hackersgeneral
#29Josh Berkus
josh@agliodbs.com
In reply to: Jonah H. Harris (#28)
hackersgeneral
#30David Scott
davids@apptechsys.com
In reply to: Jonah H. Harris (#28)
hackersgeneral
#31David Scott
davids@apptechsys.com
In reply to: Tom Lane (#2)
hackersgeneral
#32Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Josh Berkus (#29)
hackersgeneral
#33Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: David Scott (#31)
hackersgeneral
#34Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#12)
hackersgeneral
#35Jeremy Drake
pgsql@jdrake.com
In reply to: Jim Nasby (#33)
hackersgeneral
#36Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Jeremy Drake (#35)
hackersgeneral
#37Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jim Nasby (#36)
hackersgeneral
#38Jeremy Drake
pgsql@jdrake.com
In reply to: Jim Nasby (#36)
hackersgeneral