HOT for PostgreSQL 8.3

Started by Simon Riggsabout 19 years ago56 messageshackers
Jump to latest
#1Simon Riggs
simon@2ndQuadrant.com

Heap Only Tuples ("HOT") is a simplification of earlier proposals for
improving the way the server handles frequent updates, based upon what's
been learned and feedback received.

Heap Only Tuples
----------------

The basic idea is that when a tuple is UPDATEd we can, in certain
circumstances, avoid inserting index tuples for a tuple. Such tuples are
marked HEAP_ONLY_TUPLE, but their storage is otherwise identical to
other tuples. The pre-conditions for allowing a HOT UPDATE are
- UPDATE doesn't change any indexed columns
- there is space on the same block as the tuple being updated

There is no restriction on tuple length changes, nor any requirement for
an additional field in the tuple header; as a result this change does
not require activation by an additional WITH parameter and this
technique can be used on *all* tables.

HOT will, in some cases, perform better in conjunction with the use of
the fillfactor storage parameter. For smaller tables, this will seldom
be required, so database tuning will not increase in complexity (in
comparison with carefully planned VACUUM strategies in earlier
releases). In many cases, the update rate will cause a steady state to
be reached, with on-block space being reused cyclically.

At the same time we insert the HEAP_ONLY_TUPLE, the just-updated tuple
will be marked HEAP_UPDATE_ROOT. When we use an index to locate a heap
tuple, we start from this root tuple and hop forwards using the ctid
chain until we find the appropriate tuple.

CREATE INDEX requires some careful work to allow it to identify and
correctly index HEAP_ONLY_TUPLEs that need to become ~HEAP_ONLY_TUPLE as
a result of the new index. This will cause additional work to be
required for those cases. CREATE INDEX on a newly loaded table will be
completely unaffected. There is some complexity there, though we don't
go into detail on those issues here. Please read on!

To allow HOT to work effectively we need to consider how we will VACUUM,
noting that in many cases we can remove HEAP_ONLY_TUPLEs much more
easily because they have no index tuples referencing them. There are
various options at this stage, but for clarity only one of those options
is presented here.

When we try to UPDATE a tuple and the new tuple version doesn't fit on
the block, we get the BufferCleanupLock if possible and then perform a
single-block VACUUM. Any tuple that is both HEAP_DEAD & HEAP_ONLY_TUPLE
can be removed completely. This is possible by changing the t_ctid field
so that it points at the first visible-to-someone tuple in the chain, so
it points "over" the previous HOT tuples. The root tuple is also dead -
it cannot be removed completely, so it is replaced it with "just a
TupleHeader", which is referred to as a TupleStub. (Credit to Itagaki
for this concept).

e.g.

t1 (t_ctid: t2 ) - info HEAP_UPDATE_ROOT status HEAPTUPLE_DEAD
t2 (t_ctid: t3 ) - info HEAP_ONLY status HEAPTUPLE_DEAD
t3 (t_ctid:self) - info HEAP_ONLY status HEAPTUPLE_LIVE

after single-page VACUUM

t1 (t_ctid: t3 ) - info HEAP_UPDATE_ROOT & HEAP_TUPLE_STUB
- status HEAPTUPLE_RECENTLY_DEAD
- t1 is now a TupleStub only
t3 (t_ctid:self) - info HEAP_ONLY status HEAPTUPLE_LIVE

Status shown is the return value from HeapTupleSatisfiesVacuum()

The single-block VACUUM would alter *all* tuple chains on the block, not
just the one for the current tuple being UPDATEd.

This technique means that a tuple never changes its CTID, so everything
that currently uses CTID can continue normally. SeqScan would also work
identically to the way it works today.

It also means that we can't easily remove the root tuple, even if it is
now just a TupleStub (unless the whole chain is also removable because
of DELETE). Removing the root tuple will require a VACUUM *FULL*. Even
so, this option is still space-neutral in the worst-case, in comparison
with inserting index tuples.

When we perform the single-block VACUUM we don't change the FSM, nor do
we try to check/increment the table's freezelimit. HOT would alter
slightly the way that UPDATEs are signalled to stats, so that these HOT
UPDATEs don't count towards the threshold for autovacuuming - so that a
frequently HOT-updated table may only very seldom require a normal
VACUUM.

The number of itempointers would increase in many cases, though this
would still be limited by current maximums.

Various tweaks on this basic idea exist, which can be considered in more
detail if the basic concept is accepted.

- - -

This design is aimed at being a no-frills version of the code that has
already been written. The existing version is available for testing now
and will be made available on community.enterprisedb.com shortly.

Four PostgreSQL developers have various amounts of time to contribute to
developing the above solution and customising it further according to
the wishes of the Community. That is myself, Heikki Linnakangas, Pavan
Deolasee and Nikhil Sontakke. Taken together, it seems possible to craft
something that can be acceptable for PostgreSQL 8.3

Plan from here is to publish the WIP patch(es) weekly until 8.3 code
freeze, together with any performance results.

Your comments are welcome.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#1)
Re: HOT for PostgreSQL 8.3

"Simon Riggs" <simon@2ndquadrant.com> writes:

The basic idea is that when a tuple is UPDATEd we can, in certain
circumstances, avoid inserting index tuples for a tuple. Such tuples are
marked HEAP_ONLY_TUPLE, but their storage is otherwise identical to
other tuples.

What is VACUUM FULL going to do when it wants to move one of these things?

CREATE INDEX requires some careful work to allow it to identify and
correctly index HEAP_ONLY_TUPLEs that need to become ~HEAP_ONLY_TUPLE as
a result of the new index.

I think you've glossed over the CREATE INDEX problem much too easily.
The difficulty with that is going to be that de-HOT-ifying a tuple
is going to require multiple updates that can't possibly be put into
a single WAL record, and I don't think that WAL replay can clean up
after an incomplete update (since it can't run user-defined functions
and hence cannot be expected to compute index entries for itself).
So I don't think you can do that while preserving crash safety.

Removing the root tuple will require a VACUUM *FULL*.

That seems unacceptable ... it won't take too long for your table to
fill up with stubs, and we don't want to return to the bad old days
when periodic VACUUM FULL was unavoidable.

ISTM we could fix that by extending the index VACUUM interface to
include two concepts: aside from "remove these TIDs when you find them",
there could be "replace these TIDs with those TIDs when you find them".
This would allow pointer-swinging to one of the child tuples, after
which the old root could be removed. This has got the same atomicity
problem as for CREATE INDEX, because it's the same thing: you're
de-HOT-ifying the child. So if you can solve the former, I think you
can make this work too.

regards, tom lane

#3Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#2)
Re: HOT for PostgreSQL 8.3

On Wed, 2007-02-07 at 14:17 -0500, Tom Lane wrote:

"Simon Riggs" <simon@2ndquadrant.com> writes:

The basic idea is that when a tuple is UPDATEd we can, in certain
circumstances, avoid inserting index tuples for a tuple. Such tuples are
marked HEAP_ONLY_TUPLE, but their storage is otherwise identical to
other tuples.

What is VACUUM FULL going to do when it wants to move one of these things?

This question stands out from the others. I'm not sure which aspect
you're thinking of - do you see some failure cases?

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

#4Joshua D. Drake
jd@commandprompt.com
In reply to: Simon Riggs (#1)
Re: HOT for PostgreSQL 8.3

Simon Riggs wrote:

Heap Only Tuples ("HOT") is a simplification of earlier proposals for
improving the way the server handles frequent updates, based upon what's
been learned and feedback received.

Heap Only Tuples
----------------

The basic idea is that when a tuple is UPDATEd we can, in certain
circumstances, avoid inserting index tuples for a tuple. Such tuples are
marked HEAP_ONLY_TUPLE, but their storage is otherwise identical to
other tuples. The pre-conditions for allowing a HOT UPDATE are
- UPDATE doesn't change any indexed columns

Uhmmm... how often is that the case? Don't get me wrong, bravo but that
seems a rather large limitation.

Considering it, this would certainly be a boon in web space where you
have things like Rails doing:

UPDATE foo SET first_name = 'Barney' WHERE id = 1;

That would qualify correct?

Sincerely,

Joshua D. Drake

--

=== The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive PostgreSQL solutions since 1997
http://www.commandprompt.com/

Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
PostgreSQL Replication: http://www.commandprompt.com/products/

#5Merlin Moncure
mmoncure@gmail.com
In reply to: Joshua D. Drake (#4)
Re: HOT for PostgreSQL 8.3

On 2/7/07, Joshua D. Drake <jd@commandprompt.com> wrote:

Simon Riggs wrote:

Heap Only Tuples ("HOT") is a simplification of earlier proposals for
improving the way the server handles frequent updates, based upon what's
been learned and feedback received.

Uhmmm... how often is that the case? Don't get me wrong, bravo but that
seems a rather large limitation.

Considering it, this would certainly be a boon in web space where you
have things like Rails doing:

HOT is great for tables that are updated frequently via triggers or
cron right? so it this eliminate the need to vacuum foo following
executing:
update foo set v =1 where id =1;
where v is not an index, right?

if so, it would be great all kinds of things, especially
materialization techniques. or any situation where a table is updated
so frequently autovac can't keep up. I can think of tons of places
where this would be useful.

merlin

#6Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Tom Lane (#2)
Re: HOT for PostgreSQL 8.3

Tom Lane wrote:

"Simon Riggs" <simon@2ndquadrant.com> writes:

The basic idea is that when a tuple is UPDATEd we can, in certain
circumstances, avoid inserting index tuples for a tuple. Such tuples are
marked HEAP_ONLY_TUPLE, but their storage is otherwise identical to
other tuples.

What is VACUUM FULL going to do when it wants to move one of these things?

I suppose it could move the whole chain to the same target page, if
there is one with enough space to accommodate the whole chain. Or we
could just cop out and not move tuples marked with HEAP_ONLY_TUPLE. I
think that would be acceptable; after the last transaction that can see
the old tuple is finished, the old tuple is dead. After that, VACUUM
FULL can remove the old tuple and move the remaining tuple as usual.

CREATE INDEX requires some careful work to allow it to identify and
correctly index HEAP_ONLY_TUPLEs that need to become ~HEAP_ONLY_TUPLE as
a result of the new index.

I think you've glossed over the CREATE INDEX problem much too easily.
The difficulty with that is going to be that de-HOT-ifying a tuple
is going to require multiple updates that can't possibly be put into
a single WAL record, and I don't think that WAL replay can clean up
after an incomplete update (since it can't run user-defined functions
and hence cannot be expected to compute index entries for itself).
So I don't think you can do that while preserving crash safety.

Yeah, chilling tuples from HOT state to normal tuples is not easy.

One solution I thought of is to add another flag to heap tuples,
CHILL_IN_PROGRESS. To chill a tuple, you would:

1. Mark heap tuple with CHILL_IN_PROGRESS
2. Insert missing index entries
3. Clear CHILL_IN_PROGRESS and HEAP_ONLY_TUPLE flags

Index scans would ignore tuples with CHILL_IN_PROGRESS and directly
pointed to from the index. That way if we crash in the middle of step 2,
scans and updates would work normally after replay, as if the index
entries weren't there. CREATE INDEX would have to fail if there's any
CHILL_IN_PROGRESS tuples, because we wouldn't know which index entries
need to be inserted; some might already be there. To clear the
CHILL_IN_PROGRESS flag, a vacuum would be needed. Vacuum would remove
all index entries for those tuples, but instead of removing the heap
tuple in the 2nd scan it would just clear the CHILL_IN_PROGRESS flag,
bringing us back to where we started.

However, the easiest solution would be to make CREATE INDEX wait until
the old tuple is dead. That should be ok at least for concurrent CREATE
INDEX, because it already has that kind of a wait between 1st and 2nd
phase.

Removing the root tuple will require a VACUUM *FULL*.

That seems unacceptable ... it won't take too long for your table to
fill up with stubs, and we don't want to return to the bad old days
when periodic VACUUM FULL was unavoidable.

Agreed.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#7Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#2)
Re: HOT for PostgreSQL 8.3

On Wed, 2007-02-07 at 14:17 -0500, Tom Lane wrote:

"Simon Riggs" <simon@2ndquadrant.com> writes:

The basic idea is that when a tuple is UPDATEd we can, in certain
circumstances, avoid inserting index tuples for a tuple. Such tuples are
marked HEAP_ONLY_TUPLE, but their storage is otherwise identical to
other tuples.

What is VACUUM FULL going to do when it wants to move one of these things?

In addition to others suggested, one option is to rework VACUUM FULL:

Use case for VACUUM FULL is very low these days. It has an appallingly
long execution time. This can be speeded up by dropping and re-creating
indexes, says the manual, but it is still lengthy. It is even faster to
drop the indexes, do a CREATE TABLE AS SELECT * FROM table, drop the old
table and then rebuild the indexes. When moving into the new relation
the space requirement is not double, since we only copy useful
data/space and we also do this using the space the indexes occupied, so
the actual space overhead isn't that high. VACUUM FULL also generates
lots of WAL and ties up lots of memory while it operates - and it uses
memory without any constraint. The CTAS technique doesn't carry across
all of the visible tuple chains, but then to be brutally frank, by the
time VACUUM FULL has actually finished executing, it is very frequently
the oldest transaction anyway, so we needn't really have gone to all the
trouble of moving the tuple chains.

So the main use case for VACUUM FULL is when the space to be freed
inside the table is low enough to make defraging the table quicker than
a CTAS, yet still high enough that we were worried enough to do a VACUUM
FULL. Thats a very narrow use case, and if it exists at all there's a
narrow time window associated with it - only a heavily updated/deleted
table needs vacuuming anyway - that means most often be performing the
VF when we're already into the zone where the CTAS approach is quicker.
VACUUM FULL also forces us to handle various failure cases that leave
half-moved tuples scattered across tables. (Incomplete VACUUM FULLs are
actually fairly common because of its incredibly long run time).

So the radical suggestion is to continue to support the VACUUM FULL
command, but using a newly coded technique that is both higher
performance and more robust. We can either make VACUUM FULL wait until
it actually is the oldest transaction, or we can mark the table in some
way so that a lock cannot be obtained upon it by an earlier Xid. We
should be able to compact the files of a large table one physical file
at a time, so the space overhead is only ever MAX_PHYSICAL_FILESIZE and
the space overhead may become a net space gain as the VF continues. This
is of course (almost) identical to the approach already in use for
CLUSTER, so it seems like that should be acceptable. As a result, its
really not that much code and can still be accomplished on time.

Note also that we would not have to drop and re-add Foreign Keys, since
nothing can have changed while we have the table locked.

Doing this also frees up two heap info bits and simplifies many of the
HeapTupleSatisfies code, which could probably use the helping hand.
Tuples moved to the new files would retain their info bit settings as
they are copied across.

So overall, it seems a lot easier to completely replace VF than to fight
through its complexities and failure cases. If we do the above, then
we'll speed up VACUUM FULL and we'll be able to handle HOT tuples
easily.

CREATE INDEX requires some careful work to allow it to identify and
correctly index HEAP_ONLY_TUPLEs that need to become ~HEAP_ONLY_TUPLE as
a result of the new index.

I think you've glossed over the CREATE INDEX problem much too easily.
The difficulty with that is going to be that de-HOT-ifying a tuple
is going to require multiple updates that can't possibly be put into
a single WAL record, and I don't think that WAL replay can clean up
after an incomplete update (since it can't run user-defined functions
and hence cannot be expected to compute index entries for itself).
So I don't think you can do that while preserving crash safety.

No intention to gloss, just wanted to get past first post and onto the
really complex stuff. You're absolutely right that this is where much of
the thinking/work needs to take place.

Removing the root tuple will require a VACUUM *FULL*.

That seems unacceptable ... it won't take too long for your table to
fill up with stubs, and we don't want to return to the bad old days
when periodic VACUUM FULL was unavoidable.

Completely agree. I wanted to start right at the very beginning, so
everybody would understand the issues, rather than jump straight in
again with additional complexity.

ISTM we could fix that by extending the index VACUUM interface to
include two concepts: aside from "remove these TIDs when you find them",
there could be "replace these TIDs with those TIDs when you find them".
This would allow pointer-swinging to one of the child tuples, after
which the old root could be removed. This has got the same atomicity
problem as for CREATE INDEX, because it's the same thing: you're
de-HOT-ifying the child. So if you can solve the former, I think you
can make this work too.

This is looking like the best option out of the many, since it doesn't
have any serious restrictions or penalties. Let's see what Pavan thinks,
since he's been working on this aspect.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

#8Simon Riggs
simon@2ndQuadrant.com
In reply to: Heikki Linnakangas (#6)
Re: HOT for PostgreSQL 8.3

On Thu, 2007-02-08 at 14:47 +0000, Heikki Linnakangas wrote:

However, the easiest solution would be to make CREATE INDEX wait until
the old tuple is dead. That should be ok at least for concurrent CREATE
INDEX, because it already has that kind of a wait between 1st and 2nd
phase.

I'm not sure this part of the idea is possible; the rest sounded good.
Looking at DefineIndex() the wait happens only for transactions that
already have a lock on the table being indexed, which may not be very
many. AFAICS the ref page for CREATE INDEX CONCURRENTLY isn't fully
accurate (any more?) when it says "[CREATE INDEX] must wait for all
existing transactions to terminate".

Waiting until an arbitrary Xid dies could be deadlock-prone, if the lock
isn't taken carefully. Imagine a pg_dump coming towards you and then
waiting on the locked table. You'd need to wait at the beginning of the
command, before locks were taken. However, that would means CREATE INDEX
would only be possible outside of transaction blocks, which I don't
think is acceptable.

I wanted this for VACUUM FULL also, but same problem exists.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

#9Pavan Deolasee
pavan.deolasee@gmail.com
In reply to: Simon Riggs (#7)
Re: HOT for PostgreSQL 8.3

On 2/9/07, Simon Riggs <simon@2ndquadrant.com> wrote:

On Wed, 2007-02-07 at 14:17 -0500, Tom Lane wrote:

ISTM we could fix that by extending the index VACUUM interface to
include two concepts: aside from "remove these TIDs when you find them",
there could be "replace these TIDs with those TIDs when you find them".
This would allow pointer-swinging to one of the child tuples, after
which the old root could be removed. This has got the same atomicity
problem as for CREATE INDEX, because it's the same thing: you're
de-HOT-ifying the child. So if you can solve the former, I think you
can make this work too.

This is looking like the best option out of the many, since it doesn't
have any serious restrictions or penalties. Let's see what Pavan thinks,
since he's been working on this aspect.

ISTM that there two related issues that we need to solve to make
progress.

- We must make de-HOTifying or CHILLing crash safe
- Concurrent index scans should work correctly with CHILLing operations

I think the first issue can be addressed on the lines of what Heikki
suggested.
We can CHILL one tuple at a time. I am thinking of a two step process.
In the first step, the root-tuple and the heap-only tuple (which needs
CHILLing)
are marked with a special flag, CHILL_IN_PROGRESS. This operation is
WAL logged. We then insert appropriate index entries for the tuple under
consideration.

In the second step, the HEAP_UPDATED_ROOT and HEAP_ONLY_TUPLE
flags on the heap tuples are adjusted and CHILL_IN_PROGRESS flags are
cleared.

During normal operations, if CHILL_IN_PROGRESS flag is found set, we might
need to do some more work to figure out whether the index insert operations
were successful or not. If we find that there are missing index entries for
the tuple
under consideration for CHILLing, then those could be added now and flags
are set/reset appropriately.

The second problem of concurrent index scans seems a bit more complex.
We need a mechanism so that no tuples are missed or tuples are
not returned twice. Since CHILLing of a tuple adds a new access path to the
tuple from the index, a concurrent index scan may return a tuple twice.

How about grabbing a AccessExclusiveLock during CHILLing
operation ? This would prevent any concurrent index scans. Since CHILLing
of a large table can take a long time, the operation can be spread across
time with periodic acquire/release of the lock. This would prevent
starvation
of other backends. Since CHILLing is required only for CREATE INDEX
and stub-cleanup, I am assuming that its ok for it to be lazy in nature.

Thanks,
Pavan

--

EnterpriseDB http://www.enterprisedb.com

#10Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Tom Lane (#2)
Re: HOT for PostgreSQL 8.3

Tom Lane wrote:

ISTM we could fix that by extending the index VACUUM interface to
include two concepts: aside from "remove these TIDs when you find them",
there could be "replace these TIDs with those TIDs when you find them".
This would allow pointer-swinging to one of the child tuples, after
which the old root could be removed.

Implementing the "replace these TIDs" operation atomically would be
simple, except for the new bitmap index am. It should be possible there
as well, but if the old and new tid happen to be on a different bitmap
page, it requires some care to avoid deadlocks.

Also, we'd need more work mem for vacuum.

This has got the same atomicity
problem as for CREATE INDEX, because it's the same thing: you're
de-HOT-ifying the child.

Not exactly. De-HOT-ifying, or chilling, a child means inserting new
index entries. But if we're just replacing the tids from the existing
index entries, it's ok if we crash after replacing some but not all of
them. The next vacuum would replace the rest of the pointers, and remove
the old root tuple.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#11Teodor Sigaev
teodor@sigaev.ru
In reply to: Heikki Linnakangas (#10)
Re: HOT for PostgreSQL 8.3

Implementing the "replace these TIDs" operation atomically would be
simple, except for the new bitmap index am. It should be possible there

That isn't simple (may be, even possible) from GIN.
--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/

#12Tom Lane
tgl@sss.pgh.pa.us
In reply to: Teodor Sigaev (#11)
Re: HOT for PostgreSQL 8.3

Teodor Sigaev <teodor@sigaev.ru> writes:

Implementing the "replace these TIDs" operation atomically would be
simple, except for the new bitmap index am. It should be possible there

That isn't simple (may be, even possible) from GIN.

I suspect that those pushing this idea only care about btrees anyway,
so one possible answer is that HOT is only possible when the table has
only btree indexes --- or at least, only indexes of AMs that support the
replace-these-TIDs operation. (Yet another pg_am flag...)

regards, tom lane

#13Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#12)
Re: HOT for PostgreSQL 8.3

On Fri, 2007-02-09 at 10:17 -0500, Tom Lane wrote:

Teodor Sigaev <teodor@sigaev.ru> writes:

Implementing the "replace these TIDs" operation atomically would be
simple, except for the new bitmap index am. It should be possible there

That isn't simple (may be, even possible) from GIN.

I suspect that those pushing this idea only care about btrees anyway,
so one possible answer is that HOT is only possible when the table has
only btree indexes --- or at least, only indexes of AMs that support the
replace-these-TIDs operation. (Yet another pg_am flag...)

Well, thats me. Yes, I think b-trees-only is acceptable.

Realistically, very frequent updating and full text indexing are easily
separable use cases, at least into separate tables.

HOT should be of use in Data Warehousing applications also, when summary
tables are maintained alongside detailed data, but that also sounds like
HOT and bitmap indexes would be separable at the table level without
difficulty.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

#14Simon Riggs
simon@2ndQuadrant.com
In reply to: Heikki Linnakangas (#10)
Re: HOT for PostgreSQL 8.3

On Fri, 2007-02-09 at 13:39 +0000, Heikki Linnakangas wrote:

Tom Lane wrote:

ISTM we could fix that by extending the index VACUUM interface to
include two concepts: aside from "remove these TIDs when you find them",
there could be "replace these TIDs with those TIDs when you find them".
This would allow pointer-swinging to one of the child tuples, after
which the old root could be removed.

Implementing the "replace these TIDs" operation atomically would be
simple, except for the new bitmap index am. It should be possible there
as well, but if the old and new tid happen to be on a different bitmap
page, it requires some care to avoid deadlocks.

Grouped Item Indexes cope with this easily also, yes?

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

#15Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Simon Riggs (#14)
Re: HOT for PostgreSQL 8.3

Simon Riggs wrote:

On Fri, 2007-02-09 at 13:39 +0000, Heikki Linnakangas wrote:

Tom Lane wrote:

ISTM we could fix that by extending the index VACUUM interface to
include two concepts: aside from "remove these TIDs when you find them",
there could be "replace these TIDs with those TIDs when you find them".
This would allow pointer-swinging to one of the child tuples, after
which the old root could be removed.

Implementing the "replace these TIDs" operation atomically would be
simple, except for the new bitmap index am. It should be possible there
as well, but if the old and new tid happen to be on a different bitmap
page, it requires some care to avoid deadlocks.

Grouped Item Indexes cope with this easily also, yes?

Yes, as long as the old and the new tid point to the same page.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#16Hannu Krosing
hannu@tm.ee
In reply to: Heikki Linnakangas (#10)
Re: HOT for PostgreSQL 8.3

Ühel kenal päeval, R, 2007-02-09 kell 13:39, kirjutas Heikki
Linnakangas:

Tom Lane wrote:

ISTM we could fix that by extending the index VACUUM interface to
include two concepts: aside from "remove these TIDs when you find them",
there could be "replace these TIDs with those TIDs when you find them".
This would allow pointer-swinging to one of the child tuples, after
which the old root could be removed.

Implementing the "replace these TIDs" operation atomically would be
simple, except for the new bitmap index am. It should be possible there
as well, but if the old and new tid happen to be on a different bitmap
page, it requires some care to avoid deadlocks.

Also, we'd need more work mem for vacuum.

Why do we need to muck around with indexes at all ?

What are the problems with just shuffling the last (and only visible)
tuple to replace the HOT-hain root and be done with it ?

Can there be some problems with seqscans moving one tuple at a time or
doing revisits of the "same" (by TID) tuple ? If there are can't these
be fixed by creative use of ctid chains form the original live one to
the new live one ?

This has got the same atomicity
problem as for CREATE INDEX, because it's the same thing: you're
de-HOT-ifying the child.

Not exactly. De-HOT-ifying, or chilling, a child means inserting new
index entries.

If we can just move the tuple inside the page we can avoid even that.

But if we're just replacing the tids from the existing
index entries, it's ok if we crash after replacing some but not all of
them. The next vacuum would replace the rest of the pointers, and remove
the old root tuple.

--
----------------
Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me: callto:hkrosing
Get Skype for free: http://www.skype.com

#17Tom Lane
tgl@sss.pgh.pa.us
In reply to: Hannu Krosing (#16)
Re: HOT for PostgreSQL 8.3

Hannu Krosing <hannu@skype.net> writes:

What are the problems with just shuffling the last (and only visible)
tuple to replace the HOT-hain root and be done with it ?

ctid stops being a reliable identifier.

regards, tom lane

#18Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#17)
Re: HOT for PostgreSQL 8.3

On Fri, 2007-02-09 at 13:47 -0500, Tom Lane wrote:

Hannu Krosing <hannu@skype.net> writes:

What are the problems with just shuffling the last (and only visible)
tuple to replace the HOT-hain root and be done with it ?

ctid stops being a reliable identifier.

Yes, that sums it up.

The issue can be overcome in the internals, but the only means of doing
so that seemed workable required changing the output of the CTID special
column. Some applications use the tid datatype directly to relocate rows
within a transaction using the tidscan. ODBC driver uses that concept to
implement Updateable Cursors from the client. We can change that, but
we'd break all the other ones we don't know about.

This issue was one of the major contributing factors to the size of the
previous HOT prototype, making it more invasive than is really
desirable.

Pointer-swinging avoids those issues and seems workable, even if it is a
pain to have to visit the index during VACUUM. So changing CTID isn't a
bridge we really need to cross, for which I'm glad.

Just as a matter of record, the tuple identifier would need to include
(block, itemid, xmin, cmin) to make this idea work, with the itemid
being that of the root tuple and the xmin and cmin being the tuple in
the chain that is being referenced. This would've then allowed callers
to relocate a specific tuple, even when the update chain had changed
between block accesses. Anyway, glad we're not going there anymore.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

#19Simon Riggs
simon@2ndQuadrant.com
In reply to: Pavan Deolasee (#9)
Re: HOT for PostgreSQL 8.3

On Fri, 2007-02-09 at 13:16 +0530, Pavan Deolasee wrote:

The second problem of concurrent index scans seems a bit more complex.
We need a mechanism so that no tuples are missed or tuples are
not returned twice. Since CHILLing of a tuple adds a new access path
to the tuple from the index, a concurrent index scan may return a
tuple twice.

How about grabbing a AccessExclusiveLock during CHILLing
operation ? This would prevent any concurrent index scans. Since
CHILLing of a large table can take a long time, the operation can be
spread across time with periodic acquire/release of the lock. This
would prevent starvation of other backends. Since CHILLing is required
only for CREATE INDEX and stub-cleanup, I am assuming that its ok for
it to be lazy in nature.

We've just spoken about this, so just wanted to add those thoughts here.

A pointer-swing operation will begin when we see a tuple that is both
status of HEAPTUPLE_DEAD and is marked HEAP_UPDATE_ROOT. Perhaps that
requires a new status from HeapTupleSatisfiesVacuum()? We chill all
tuples in the chain, up to the new root. We mark those tuples, so that
HeapTupleSatisfiesVacuum() will be describe them as
HEAPTUPLE_RECENTLY_DEAD. So the current Vacuum won't immediately remove
them, but they'll go away in the future as part of an on-demand block
vacuum or Vacuum. That's similar to the way we handle HALF_DEAD index
pages.

The index scans are page-at-a-time, so when we pointer-swing from the
root tuple to one of the HOT tuples we'll be OK. We're switching a
specific index tuple, so there's no multi-page locking on the index to
consider.

Right now, I wouldn't want to assume that the way tuples are marked
prior to pointer-swinging is exactly the same as the chilling required
by CREATE INDEX: CHILL_IN_PROGRESS. It may well be, but I'm wary that we
assume they are exactly the same and introduce a subtle bug.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

#20Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#2)
Re: HOT for PostgreSQL 8.3

Tom Lane wrote:

Removing the root tuple will require a VACUUM *FULL*.

That seems unacceptable ... it won't take too long for your table to
fill up with stubs, and we don't want to return to the bad old days
when periodic VACUUM FULL was unavoidable.

ISTM we could fix that by extending the index VACUUM interface to
include two concepts: aside from "remove these TIDs when you find them",
there could be "replace these TIDs with those TIDs when you find them".
This would allow pointer-swinging to one of the child tuples, after
which the old root could be removed. This has got the same atomicity
problem as for CREATE INDEX, because it's the same thing: you're
de-HOT-ifying the child. So if you can solve the former, I think you
can make this work too.

I need clarification here. Is removing dead heap tuple always going to
require an index scan, or was this just for chilling a row (adding an
index)?

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

#21Simon Riggs
simon@2ndQuadrant.com
In reply to: Bruce Momjian (#20)
#22Bruce Momjian
bruce@momjian.us
In reply to: Simon Riggs (#21)
#23Hannu Krosing
hannu@tm.ee
In reply to: Simon Riggs (#1)
#24Tom Lane
tgl@sss.pgh.pa.us
In reply to: Hannu Krosing (#23)
#25Hannu Krosing
hannu@tm.ee
In reply to: Tom Lane (#24)
#26Pavan Deolasee
pavan.deolasee@gmail.com
In reply to: Hannu Krosing (#25)
#27Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Pavan Deolasee (#26)
#28Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Hannu Krosing (#25)
#29Pavan Deolasee
pavan.deolasee@gmail.com
In reply to: Heikki Linnakangas (#28)
#30Tom Lane
tgl@sss.pgh.pa.us
In reply to: Hannu Krosing (#25)
#31Zeugswetter Andreas SB SD
ZeugswetterA@spardat.at
In reply to: Simon Riggs (#18)
#32Mark Mielke
mark@mark.mielke.cc
In reply to: Tom Lane (#30)
#33Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Mark Mielke (#32)
#34Hannu Krosing
hannu@tm.ee
In reply to: Heikki Linnakangas (#33)
#35Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Hannu Krosing (#34)
#36Tom Lane
tgl@sss.pgh.pa.us
In reply to: Heikki Linnakangas (#35)
#37Pavan Deolasee
pavan.deolasee@gmail.com
In reply to: Tom Lane (#36)
#38Hannu Krosing
hannu@tm.ee
In reply to: Tom Lane (#36)
#39Pavan Deolasee
pavan.deolasee@gmail.com
In reply to: Hannu Krosing (#38)
#40Tom Lane
tgl@sss.pgh.pa.us
In reply to: Hannu Krosing (#38)
#41Hannu Krosing
hannu@tm.ee
In reply to: Tom Lane (#40)
#42Pavan Deolasee
pavan.deolasee@gmail.com
In reply to: Hannu Krosing (#41)
#43Zeugswetter Andreas SB SD
ZeugswetterA@spardat.at
In reply to: Hannu Krosing (#41)
#44Pavan Deolasee
pavan.deolasee@gmail.com
In reply to: Zeugswetter Andreas SB SD (#43)
#45Zeugswetter Andreas SB SD
ZeugswetterA@spardat.at
In reply to: Pavan Deolasee (#44)
#46Pavan Deolasee
pavan.deolasee@gmail.com
In reply to: Zeugswetter Andreas SB SD (#45)
#47Simon Riggs
simon@2ndQuadrant.com
In reply to: Pavan Deolasee (#29)
#48Pavan Deolasee
pavan.deolasee@gmail.com
In reply to: Simon Riggs (#47)
#49Simon Riggs
simon@2ndQuadrant.com
In reply to: Pavan Deolasee (#48)
#50Zeugswetter Andreas SB SD
ZeugswetterA@spardat.at
In reply to: Pavan Deolasee (#48)
#51Pavan Deolasee
pavan.deolasee@gmail.com
In reply to: Zeugswetter Andreas SB SD (#50)
#52Zeugswetter Andreas SB SD
ZeugswetterA@spardat.at
In reply to: Pavan Deolasee (#51)
#53Pavan Deolasee
pavan.deolasee@gmail.com
In reply to: Zeugswetter Andreas SB SD (#52)
#54Zeugswetter Andreas SB SD
ZeugswetterA@spardat.at
In reply to: Pavan Deolasee (#53)
#55Pavan Deolasee
pavan.deolasee@gmail.com
In reply to: Zeugswetter Andreas SB SD (#54)
#56Zeugswetter Andreas SB SD
ZeugswetterA@spardat.at
In reply to: Pavan Deolasee (#55)