Frequent Update Project: Design Overview of HOT Updates

Started by Simon Riggsover 19 years ago50 messageshackers
Jump to latest
#1Simon Riggs
simon@2ndQuadrant.com

Design Overview of HOT Updates
------------------------------

The objective is to increase the speed of the UPDATE case, while
minimizing the overall negative effects of the UPDATE. We refer to the
general requirement as *Frequent Update Optimization*, though this
design proposal is for Heap Overflow Tuple (HOT) Updates. It is similar
in some ways to the design for SITC already proposed, though has a
number of additional features drawn from other designs to make it a
practical and effective implementation.

EnterpriseDB have a working, performant prototype of this design. There
are still a number of issues to resolve and the intention is to follow
an open community process to find the best way forward. All required
detail will be provided for the work conducted so far.

Current PGSQL behaviour is for UPDATEs to create a new tuple version
within the heap, so acts from many perspectives as if it were an INSERT.
All of the tuple versions are chained together, so that whichever of the
tuples is visible to your Snapshot, you can walk the chain to find the
most recent tuple version to update.

The HOT proposal splits the heap into two areas: the main heap and an
overflow relation, both of which are regular, transactional relations.
INSERT and DELETE effect only the main heap exactly as they do now.
UPDATE places new tuple versions into the overflow relation in most
cases, maintaining all of the current capabilities of the MVCC model:
all tuple versions remain accessible, plus the chain of updated tuple
versions is maintained.

UPDATEs do not insert into indexes at all - no index pointers exist at
all into the overflow relation. So during a stream of UPDATEs the main
heap and indexes stay the same size, while the indexes are used
read-only, avoiding additional database writes and the need for eventual
index rebuilding.

The current tuple version within the main heap is referred to as the
"root tuple" from now on. When reading the main heap, if the root tuple
is not visible we "walk the chain" into the overflow relation until we
find a tuple version that is visible - we follow the t_ctid pointer from
tuple to tuple, testing for MVCC visibility as we go.

As more UPDATEs take place these tuple chains would grow, making
locating the latest tuple take progressively longer. Intuitively this
sounds like a bad design, though an additional feature turns that from
bad to good performance:

- when anybody walks the chain into the overflow relation, if we find
that the root tuple is vacuumable we copy back the first visible tuple
version over the now-dead root tuple.

This allows the length of a typical tuple chain to be extremely short in
practice. For a single connection issuing a stream of UPDATEs the chain
length will no more than 1 at any time. Even under heavy concurrent
UPDATEs the modal chain length remains 1, with the chain length varying
in approximately Poisson distribution.

The overflow relation is similar in concept to a TOAST table, so we
might describe this approach as TOAST-for-updated-versions. The code
implementation is similar also, with reasonably similar modularity. HOT
does sound radical, but no more so than TOAST was when first discussed.

We can locate any tuple version and thereby allow MVCC to work
correctly, while at the same time preserving the crucial property of the
Postgres non-overwriting storage manager. This works very well for
indexed tuple access since the index and main heap never grow, thus
maintaining access speeds. HOT supports both Read Committed and
Serializable transaction isolation levels, with transactions of any
length, just as with current MVCC.

Performance results against the previously described test cases are
approximately:

1. pgbench (TPC-B) - Around 200-300% better

2. DBT-2 (TPC-C) - Signficicant improvement - possibly 10% improvement,
somewhat difficult to measure exactly because of the way the test itself
works.

3. truckin - Around 500% improvement

The performance gain remains better than REL8_2 base even in the
presence of longer running transactions.

[There is also considerable synergy with changes to b-tree indexes that
are both useful in their own right, and even better when HOT is added,
more on that on a separate thread.]

We expect that people will want to measure this for themselves, so we
don't publish all of the detailed internal tests here since YMMV.

Using HOT
---------

HOT can only work in cases where a tuple does not modify one of the
columns defined in an index on the table, and when we do not alter the
row length of the tuple. [We'll be able to do that more efficiently when
we have plan invalidation]

If we perform an update that meets the HOT criteria then we put the
new version into the overflow relation; we describe this as a HOT
UPDATE. If we perform an update that does not meet the criteria, then we
carry on with the existing/old MVCC behaviour; we describe this as a
non-HOT UPDATE.

So with HOT we must support both HOT and non-HOT UPDATEs. Each tuple
whether it is in the main heap or the overflow relation can point to
another tuple with the t_ctid, which is extended by using a previously
unused bit to flag whether the destination is a block from the main heap
or the overflow relation.

Just to reiterate, since a HOT update doesn't touch index values that
means that all members of a tuple chain have identical primary keys and
index values. So we only ever need to grovel through the overflow
relation *if* we have already met the index search criteria - so we
don't have to re-check the index criteria for each tuple version we
touch. All normal updates of a table will be read-only on the index,
plus writes to heap/overflow.

Simple Example of Tuple Chaining
--------------------------------

Lets look at a simple case:

1. INSERT INTO table (1,....)
Tuple is inserted into main heap (Version1 or V1)

Heap Page P1 Overflow Relation
---------------- -----------------
| | | |
| ______ | | |
| | V1 | | | |
| |______| | | |
| | | |
|--------------| |---------------|

2. UPDATE table SET nonindexcol = 6 WHERE pkcol = 1;
Same length tuple, so V2 written using overflow

Heap Page P1 Overflow Relation
---------------- -----------------
| | | ______ |
| ______ | |------>| V2 | |
| | V1 |------| | |______| |
| |______| | | |
| | | |
|--------------| |---------------|

3. UPDATE table SET nonindexcol = 7 WHERE pkcol = 1;
Same length tuple, so V3 written using overflow
We update v2.t_ctid to maintain the forward tuple chain

Heap Page P1 Overflow Relation
---------------- -----------------
| | | ______ |
| ______ | |------>| V2 | |
| | V1 |------| | |______|--| |
| |______| | | ______ | |
| | | | V3 |<-| |
| | | |______| |
|--------------| |---------------|

Simple Example of Copying-back
------------------------------

4. UPDATE table SET nonindexcol = 8 WHERE pkcol = 1;
Same length tuple, so V4 will be written using overflow.
While selecting for UPDATE, we look at V1, we see it is now vacuumable,
so we first copy back the first visible tuple over V1 before proceeding.
[Assume V1 is dead, V2 is still visible]

Heap Page P1 Overflow Relation
---------------- -----------------
| | | ______ |
| ______ | | | V2* | |
| | V2 |------| | |______| |
| |______| | | | ______ |
| | |------>| V3 | |
| | | |______| |
|--------------| |---------------|

The copy back operation leaves V2* as the now-unwritable old copy of V2.
We update it also to break the chain to V3. Now orphaned, it can be
removed later.

5. Now we continue with the UPDATE as planned
Same length tuple, so V4 will be written using overflow.
We update v3.t_ctid to maintain the forward tuple chain

Heap Page P1 Overflow Relation
---------------- -----------------
| | | ______ |
| ______ | | | V2* | |
| | V2 |------| | |______| |
| |______| | | | ______ |
| | |------>| V3 | |
| | | |______|--| |
| | | ______ | |
| | | | V4 |<-| |
| | | |______| |
|--------------| |---------------|

Copying Back
------------

The copy back operation is required to keep the tuple chains short and
efficient. When we copy back we leave a tuple version and possibly a
long part of the tuple chain orphaned. These are then marked for removal
by the next VACUUM.

To support copying back, we add an additional header fields onto
every tuple in the overflow relation (only). We generate WAL
appropriately for this action.

VACUUMing becomes significantly easier with HOT than with non-HOT
UPDATEs. Specifically, since there are no indexes on the overflow
relation we should be able to VACUUM it more easily and possibly
piece-by-piece (i.e. "retail vacuum").

Behavioural Characteristics
---------------------------

With HOT, it is easily possible that the chain of prior versions spans
many blocks. The chain always starts with the block of the root tuple
but possibly includes many overflow relation blocks.

A SeqScan of a HOT table will turn into a large number of
random accesses to the overflow relation, which could be considerably
more expensive than sequential I/O. Clearly very large tables would not
be able to be HOT enabled without additional cost, so we make HOT a
table-level option: WITH (freqUPDATE = on) [discuss...]

This means that a SeqScan of a heap with heavy concurrent update
activity *could* be fairly costly. Most times it won't be, because the
appropriate overflow relation blocks will most likely be in-memory. That
is the price we pay for having this optimization - a suitable warning
would apply, though this would not be a surprise to anybody since Oracle
people know that reading heavily-updated data takes more CPU and
DB2/SQLServer know that it takes ages to get around each of the row
level locks. So this isn't bad *at all* in comparison.

We hope/expect/recommend HOT to be used in conjunction with UPDATE
RETURNING, so that fewer SELECTs and non-index operations will be used.

[There's an extra prize if you can think of a *correct* way of
SeqScanning with HOT more efficiently; we've not focused on that yet.]

In the presence of a long running transaction, the overflow relation
could become very large since VACUUM becomes less effective. (This same
situation occurs with current MVCC). The overflow relation will then
start to page out to disk and could grow large. Oracle has this problem
also and this is why Oracle 10g has moved away from fixed-size Rollback
Segments to the use of an overflow Tablespace. Some size controls are
available to limit the growth there. That would be a useful optimization
for HOT also and one much easier than any such enhancement of existing
MVCC.

Next Steps
----------

Deeper technical details will be published shortly; the design goes much
deeper than the discussion here, but we wanted to present the overall
summary first.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

#2Martijn van Oosterhout
kleptog@svana.org
In reply to: Simon Riggs (#1)
Re: Frequent Update Project: Design Overview of HOT Updates

Nice idea, just one question:

On Thu, Nov 09, 2006 at 05:13:16PM +0000, Simon Riggs wrote:

Behavioural Characteristics
---------------------------

With HOT, it is easily possible that the chain of prior versions spans
many blocks. The chain always starts with the block of the root tuple
but possibly includes many overflow relation blocks.

A SeqScan of a HOT table will turn into a large number of
random accesses to the overflow relation, which could be considerably
more expensive than sequential I/O. Clearly very large tables would not
be able to be HOT enabled without additional cost, so we make HOT a
table-level option: WITH (freqUPDATE = on) [discuss...]

It seems to me that bitmap index scans will get these same
characteristics also, right? The bitmap scan will have to follow the
chain of any possibly matching tuple in any of the blocks that are in
the bitmap, right?

Have a ncie day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

From each according to his ability. To each according to his ability to litigate.

#3Simon Riggs
simon@2ndQuadrant.com
In reply to: Martijn van Oosterhout (#2)
Re: Frequent Update Project: Design Overview of HOTUpdates

On Thu, 2006-11-09 at 18:49 +0100, Martijn van Oosterhout wrote:

Nice idea, just one question:

It seems to me that bitmap index scans will get these same
characteristics also, right? The bitmap scan will have to follow the
chain of any possibly matching tuple in any of the blocks that are in
the bitmap, right?

Yes, they would identify the root tuples. The whole chain has matching
index values, by definition, so the re-evaluation for lossy bitmaps will
work just the same before the actual tuple is retrieved by walking the
chain (if required).

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

#4Josh Berkus
josh@agliodbs.com
In reply to: Simon Riggs (#1)
Re: Frequent Update Project: Design Overview of HOT Updates

Simon,

If we perform an update that meets the HOT criteria then we put the
new version into the overflow relation; we describe this as a HOT
UPDATE. If we perform an update that does not meet the criteria, then we
carry on with the existing/old MVCC behaviour; we describe this as a
non-HOT UPDATE.

Making the essential performance analysis question, "Am I HOT or Not?"

Sorry, but someone had to say it. ;-)

--
--Josh

Josh Berkus
PostgreSQL @ Sun
San Francisco

#5Simon Riggs
simon@2ndQuadrant.com
In reply to: Josh Berkus (#4)
Re: Frequent Update Project: Design Overview of HOTUpdates

On Thu, 2006-11-09 at 13:21 -0800, Josh Berkus wrote:

Simon,

If we perform an update that meets the HOT criteria then we put the
new version into the overflow relation; we describe this as a HOT
UPDATE. If we perform an update that does not meet the criteria, then we
carry on with the existing/old MVCC behaviour; we describe this as a
non-HOT UPDATE.

Making the essential performance analysis question, "Am I HOT or Not?"

Very good. ;-)

Well, we had Overflow Update CHaining as an alternative name... :-)

The naming sounds silly, but we had a few alternate designs, so we
needed to be able to tell them apart sensibly. We've had TVR, SITC, UIP
and now HOT. Software research...

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#1)
Re: Frequent Update Project: Design Overview of HOT Updates

"Simon Riggs" <simon@2ndquadrant.com> writes:

As more UPDATEs take place these tuple chains would grow, making
locating the latest tuple take progressively longer.

This is the part that bothers me --- particularly the random-access
nature of the search. I wonder whether you couldn't do something
involving an initial table fill-factor of less than 50%, and having
the first updated version living on the same heap page as its parent.
Only when the active chain length was more than one (which you
hypothesize is rare) would it actually be necessary to do a random
access into the overflow table.

More generally, do we need an overflow table at all, rather than having
these overflow tuples living in the same file as the root tuples? As
long as there's a bit available to mark a tuple as being this special
not-separately-indexed type, you don't need a special location to know
what it is. This might break down in the presence of seqscans though.

Actually, you omitted to mention the locking aspects of moving tuples
around --- exactly how are you going to make that work without breaking
concurrent scans?

This allows the length of a typical tuple chain to be extremely short in
practice. For a single connection issuing a stream of UPDATEs the chain
length will no more than 1 at any time.

Only if there are no other transactions being held open, which makes
this claim a lot weaker.

HOT can only work in cases where a tuple does not modify one of the
columns defined in an index on the table, and when we do not alter the
row length of the tuple.

Seems like "altering the row length" isn't the issue, it's just "is
there room on the page for the new version". Again, a generous
fillfactor would give you more flexibility.

[We'll be able to do that more efficiently when
we have plan invalidation]

Uh, what's that got to do with it?

regards, tom lane

#7Josh Berkus
josh@agliodbs.com
In reply to: Tom Lane (#6)
Re: Frequent Update Project: Design Overview of HOT Updates

Tom,

Actually, you omitted to mention the locking aspects of moving tuples
around --- exactly how are you going to make that work without breaking
concurrent scans?

I believe that's the "unsolved technical issue" in the prototype, unless
Pavan has solved it in the last two weeks. Pavan?

--
--Josh

Josh Berkus
PostgreSQL @ Sun
San Francisco

#8Pavan Deolasee
pavan.deolasee@gmail.com
In reply to: Josh Berkus (#7)
Re: Frequent Update Project: Design Overview of HOT Updates

On 11/10/06, Josh Berkus <josh@agliodbs.com> wrote:

Tom,

Actually, you omitted to mention the locking aspects of moving tuples
around --- exactly how are you going to make that work without breaking
concurrent scans?

I believe that's the "unsolved technical issue" in the prototype, unless
Pavan has solved it in the last two weeks. Pavan?

When an overflow tuple is copied back to the main heap, the overflow tuple
is
marked with a special flag (HEAP_OVERFLOW_MOVEDBACK). Subsequently,
when a backend tries to lock the overflow version of the tuple, it checks
the flag
and jumps to the main heap if the flag is set.

We use the same technique to update the correct version of a tuple when a
tuple is moved back to the main heap.

Regards,
Pavan

#9Tom Lane
tgl@sss.pgh.pa.us
In reply to: Pavan Deolasee (#8)
Re: Frequent Update Project: Design Overview of HOT Updates

"Pavan Deolasee" <pavan.deolasee@gmail.com> writes:

On 11/10/06, Josh Berkus <josh@agliodbs.com> wrote:

I believe that's the "unsolved technical issue" in the prototype, unless
Pavan has solved it in the last two weeks. Pavan?

When an overflow tuple is copied back to the main heap, the overflow tuple
is
marked with a special flag (HEAP_OVERFLOW_MOVEDBACK). Subsequently,
when a backend tries to lock the overflow version of the tuple, it checks
the flag
and jumps to the main heap if the flag is set.

(1) How does it "jump to the main heap"? The links go the other
direction.

(2) Isn't this full of race conditions?

(3) I thought you already used up the one remaining t_infomask bit.

regards, tom lane

#10Pavan Deolasee
pavan.deolasee@gmail.com
In reply to: Tom Lane (#9)
Re: Frequent Update Project: Design Overview of HOT Updates

On 11/10/06, Tom Lane <tgl@sss.pgh.pa.us> wrote:

"Pavan Deolasee" <pavan.deolasee@gmail.com> writes:

On 11/10/06, Josh Berkus < josh@agliodbs.com> wrote:

I believe that's the "unsolved technical issue" in the prototype,

unless

Pavan has solved it in the last two weeks. Pavan?

When an overflow tuple is copied back to the main heap, the overflow

tuple

is
marked with a special flag (HEAP_OVERFLOW_MOVEDBACK). Subsequently,
when a backend tries to lock the overflow version of the tuple, it

checks

the flag
and jumps to the main heap if the flag is set.

(1) How does it "jump to the main heap"? The links go the other
direction.

The overflow tuple has a special header to store the back pointer to the
main heap.
This increases the tuple header size by 6 bytes, but the overhead is
restricted only to the overflow
tuples.

(2) Isn't this full of race conditions?

I agree, there could be race conditions. But IMO we can handle those. When
we
follow the tuple chain, we hold a SHARE lock on the main heap buffer. Also,
when
the root tuple is vacuumable and needs to be overwritten, we acquire and
keep EXCLUSIVE
lock on the main heap buffer.

This reduces the race conditions to a great extent.

(3) I thought you already used up the one remaining t_infomask bit.

Yes. The last bit in the t_infomask is used up to mark presence of overflow
tuple header. But I believe there are few more bits that can be reused.
There are three bits available in the t_ctid field as well (since ip_posid
needs maximum 13 bits). One bit is used to identify whether a given tid
points to the main heap or the overflow heap. This helps when tids are
passed around in the code.

Since the back pointer from the overflow tuple always points to the main
heap, the same bit can be used to mark copied-back tuples (we are doing it
in a slight different way in the current prototype though).

Regards,
Pavan

#11Hannu Krosing
hannu@tm.ee
In reply to: Tom Lane (#6)
Re: Frequent Update Project: Design Overview of HOT

Ühel kenal päeval, N, 2006-11-09 kell 18:28, kirjutas Tom Lane:

"Simon Riggs" <simon@2ndquadrant.com> writes:

As more UPDATEs take place these tuple chains would grow, making
locating the latest tuple take progressively longer.

This is the part that bothers me --- particularly the random-access
nature of the search. I wonder whether you couldn't do something
involving an initial table fill-factor of less than 50%, and having
the first updated version living on the same heap page as its parent.
Only when the active chain length was more than one (which you
hypothesize is rare) would it actually be necessary to do a random
access into the overflow table.

More generally, do we need an overflow table at all, rather than having
these overflow tuples living in the same file as the root tuples? As
long as there's a bit available to mark a tuple as being this special
not-separately-indexed type, you don't need a special location to know
what it is. This might break down in the presence of seqscans though.

And why do you need to mark it as not-separately-indexed at all ?

We already cope with missing index pointers in VACUUM and I can't see
any other reason to have it.

What are the advantages of HOT over SITC (other than cool name) ?

Maybe just make HOT an extended SITC which can span pages.

In case of HOT together with reusing index tuples with DELETED bit set
we don't actually need copyback, but the same index pointer will follow
the head of live data automatically, maybe lagging only a small number
of versions.

Actually, you omitted to mention the locking aspects of moving tuples
around --- exactly how are you going to make that work without breaking
concurrent scans?

This allows the length of a typical tuple chain to be extremely short in
practice. For a single connection issuing a stream of UPDATEs the chain
length will no more than 1 at any time.

Only if there are no other transactions being held open, which makes
this claim a lot weaker.

HOT can only work in cases where a tuple does not modify one of the
columns defined in an index on the table, and when we do not alter the
row length of the tuple.

Seems like "altering the row length" isn't the issue, it's just "is
there room on the page for the new version". Again, a generous
fillfactor would give you more flexibility.

Maybe they hoped to take very light locks when new chaoin head is copied
iver the old one in the same-length case.

http://www.postgresql.org/about/donate

--
----------------
Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me: callto:hkrosing
Get Skype for free: http://www.skype.com

NOTICE: This communication contains privileged or other confidential
information. If you have received it in error, please advise the sender
by reply email and immediately delete the message and any attachments
without copying or disclosing the contents.

#12Tom Lane
tgl@sss.pgh.pa.us
In reply to: Pavan Deolasee (#10)
Re: Frequent Update Project: Design Overview of HOT Updates

"Pavan Deolasee" <pavan.deolasee@gmail.com> writes:

On 11/10/06, Tom Lane <tgl@sss.pgh.pa.us> wrote:
(2) Isn't this full of race conditions?

I agree, there could be race conditions. But IMO we can handle those.

Doubtless you can prevent races by introducing a bunch of additional
locking. The question was really directed to how much concurrent
performance is left, once you get done locking it down.

(3) I thought you already used up the one remaining t_infomask bit.

Yes. The last bit in the t_infomask is used up to mark presence of overflow
tuple header. But I believe there are few more bits that can be reused.
There are three bits available in the t_ctid field as well (since ip_posid
needs maximum 13 bits).

No, you cannot have those bits --- BLCKSZ is not limited to 8K, and even
if it were, we will not hold still for sticking flag bits into an
unrelated datatype.

You can probably fix this by inventing multiple context-dependent
interpretations of t_infomask bits, but that strikes me as a mighty
ugly and fragile way to proceed.

(Actually, the assumption that you can throw an additional back-pointer
into overflow tuple headers is the worst feature of this proposal in
that regard --- it's really not that easy to support multiple header
formats.)

regards, tom lane

#13Hannu Krosing
hannu@tm.ee
In reply to: Hannu Krosing (#11)
Re: Frequent Update Project: Design Overview of HOT

Ühel kenal päeval, R, 2006-11-10 kell 09:06, kirjutas Hannu Krosing:

Ühel kenal päeval, N, 2006-11-09 kell 18:28, kirjutas Tom Lane:

"Simon Riggs" <simon@2ndquadrant.com> writes:

As more UPDATEs take place these tuple chains would grow, making
locating the latest tuple take progressively longer.

This is the part that bothers me --- particularly the random-access
nature of the search. I wonder whether you couldn't do something
involving an initial table fill-factor of less than 50%, and having
the first updated version living on the same heap page as its parent.
Only when the active chain length was more than one (which you
hypothesize is rare) would it actually be necessary to do a random
access into the overflow table.

More generally, do we need an overflow table at all, rather than having
these overflow tuples living in the same file as the root tuples? As
long as there's a bit available to mark a tuple as being this special
not-separately-indexed type, you don't need a special location to know
what it is. This might break down in the presence of seqscans though.

And why do you need to mark it as not-separately-indexed at all ?

We already cope with missing index pointers in VACUUM and I can't see
any other reason to have it.

Ok, now I see it - we can't VACUUM a tuple, if next versions of it are
accessible by t_ctid chain only. That is vacuum must not free tuples,
which have t_ctid pointing to a tuple that has not-separately-indexed
bit set. This seems to make vacuum quite complicated, as it has to
examine c_tid chains to detect if it can free a tuple, and what's even
worse, it has to examine these chains backwards.

What are the advantages of HOT over SITC (other than cool name) ?

still wondering this, is it just the abilty to span multiple pages ?

Maybe just make HOT an extended SITC which can span pages.

In case of HOT together with reusing index tuples with DELETED bit set
we don't actually need copyback, but the same index pointer will follow
the head of live data automatically, maybe lagging only a small number
of versions.

Actually, you omitted to mention the locking aspects of moving tuples
around --- exactly how are you going to make that work without breaking
concurrent scans?

This allows the length of a typical tuple chain to be extremely short in
practice. For a single connection issuing a stream of UPDATEs the chain
length will no more than 1 at any time.

Only if there are no other transactions being held open, which makes
this claim a lot weaker.

HOT can only work in cases where a tuple does not modify one of the
columns defined in an index on the table, and when we do not alter the
row length of the tuple.

Seems like "altering the row length" isn't the issue, it's just "is
there room on the page for the new version". Again, a generous
fillfactor would give you more flexibility.

Maybe they hoped to take very light locks when new chaoin head is copied
iver the old one in the same-length case.

http://www.postgresql.org/about/donate

--
----------------
Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me: callto:hkrosing
Get Skype for free: http://www.skype.com

NOTICE: This communication contains privileged or other confidential
information. If you have received it in error, please advise the sender
by reply email and immediately delete the message and any attachments
without copying or disclosing the contents.

#14Simon Riggs
simon@2ndQuadrant.com
In reply to: Hannu Krosing (#13)
Re: Frequent Update Project: Design Overview of HOT

On Fri, 2006-11-10 at 09:51 +0200, Hannu Krosing wrote:

What are the advantages of HOT over SITC (other than cool name) ?

still wondering this, is it just the abilty to span multiple pages ?

Multiple page spanning, copy back/VACUUM support, separate overflow
relation to prevent heap growth were the main ones.

There's ideas in HOT from lots of people and places, SITC was just one
of those. There wasn't a dust-off between different designs, just an
attempt to take the best designs from wherever/whoever they came from,
so HOT isn't one person's design.

If anybody wants to change the name, we can, to whatever you like.

Maybe just make HOT an extended SITC which can span pages.

There is a strong willingness to get the design optimal, which will
require change.

http://www.postgresql.org/about/donate

That's exactly what HOT is all about....

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

#15Pavan Deolasee
pavan.deolasee@gmail.com
In reply to: Tom Lane (#12)
Re: Frequent Update Project: Design Overview of HOT Updates

On 11/10/06, Tom Lane <tgl@sss.pgh.pa.us > wrote:

"Pavan Deolasee" <pavan.deolasee@gmail.com> writes:

Yes. The last bit in the t_infomask is used up to mark presence of

overflow

tuple header. But I believe there are few more bits that can be reused.
There are three bits available in the t_ctid field as well (since

ip_posid

needs maximum 13 bits).

No, you cannot have those bits --- BLCKSZ is not limited to 8K, and even
if it were, we will not hold still for sticking flag bits into an
unrelated datatype.

BLCKSZ is not limited to 8K, but it is limited to 32K because of lp_off and
lp_len
which are 15 bits in size.

OffsetNumber is limited to (BLCKSZ / sizeof(ItemIdData)). Since
sizeof(ItemIdData) is 4 bytes, MaxOffsetNumber is 8192. So we need only 13
bits to represent the MaxOffsetNumber.

Regards,
Pavan

#16Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Tom Lane (#12)
Re: Frequent Update Project: Design Overview of HOT Updates

Tom Lane wrote:

"Pavan Deolasee" <pavan.deolasee@gmail.com> writes:

Yes. The last bit in the t_infomask is used up to mark presence of overflow
tuple header. But I believe there are few more bits that can be reused.
There are three bits available in the t_ctid field as well (since ip_posid
needs maximum 13 bits).

No, you cannot have those bits --- BLCKSZ is not limited to 8K, and even
if it were, we will not hold still for sticking flag bits into an
unrelated datatype.

You can probably fix this by inventing multiple context-dependent
interpretations of t_infomask bits, but that strikes me as a mighty
ugly and fragile way to proceed.

Yeah, that seems ugly.

I think the best place to grab more bits is the t_natts field. The max
value for that is MaxTupleAttributeNumber == 1664, which fits in 11
bits. That leaves 5 bits for other use. We'll have to replace all direct
access to that field with an accessor macro, but according to grep there
isn't that many files that need to be changed.

(Actually, the assumption that you can throw an additional back-pointer
into overflow tuple headers is the worst feature of this proposal in
that regard --- it's really not that easy to support multiple header
formats.)

Well, we already have a variable length null bitmap in the header. It
seems quite straightforward to me to add the new field before the null
bitmap. It certainly requires some changes, in particular to places that
access the null bitmap, but it's not an insurmountable effort. Or am I
missing some less obvious consequences?

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#17Zeugswetter Andreas SB SD
ZeugswetterA@spardat.at
In reply to: Tom Lane (#6)
Re: Frequent Update Project: Design Overview of HOT Updates

As more UPDATEs take place these tuple chains would grow, making
locating the latest tuple take progressively longer.

More generally, do we need an overflow table at all, rather
than having these overflow tuples living in the same file as
the root tuples? As long as there's a bit available to mark
a tuple as being this special not-separately-indexed type,
you don't need a special location to know what it is.

Yes, I have that same impression.

1. It doubles the IO (original page + hot page), if the new row would
have fit into the original page.

2. locking should be easier if only the original heap page is involved.

3. It makes the overflow pages really hot because all concurrent updates
compete
for the few overflow pages.

4. although at first it might seem so I see no advantage for vacuum with
overflow

5. the size reduction of heap is imho moot because you trade it for a
growing overflow
(size reduction only comes from reusing dead tuples and not
adding index tuples --> SITC)

Could you try to explain the reasoning behind separate overflow storage
?
What has been stated so far was not really conclusive to me in this
regard.
e.g. a different header seems no easier in overflow than in heap

Andreas

#18Pavan Deolasee
pavan.deolasee@gmail.com
In reply to: Heikki Linnakangas (#16)
Re: Frequent Update Project: Design Overview of HOT Updates

On 11/10/06, Heikki Linnakangas <heikki@enterprisedb.com> wrote:

Tom Lane wrote:

(Actually, the assumption that you can throw an additional back-pointer
into overflow tuple headers is the worst feature of this proposal in
that regard --- it's really not that easy to support multiple header
formats.)

Well, we already have a variable length null bitmap in the header. It
seems quite straightforward to me to add the new field before the null
bitmap. It certainly requires some changes, in particular to places that
access the null bitmap, but it's not an insurmountable effort. Or am I
missing some less obvious consequences?

We have added the overflow header (which right now contains a single entry
i.e.
the back pointer) on the very similar lines to optional Oid field in the
tuple header.
A flag (the last free in the t_infomask) is used to check if there is an
additional
overflow header and if so t_hoff is adjusted appropriately.

So in the current prototype, the overflow header is after the null bitmap
and before the Oid, if it exists.

Regards,
Pavan

#19Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#6)
Re: Frequent Update Project: Design Overview of HOTUpdates

On Thu, 2006-11-09 at 18:28 -0500, Tom Lane wrote:

"Simon Riggs" <simon@2ndquadrant.com> writes:

As more UPDATEs take place these tuple chains would grow, making
locating the latest tuple take progressively longer.

This is the part that bothers me --- particularly the random-access
nature of the search. I wonder whether you couldn't do something
involving an initial table fill-factor of less than 50%, and having
the first updated version living on the same heap page as its parent.
Only when the active chain length was more than one (which you
hypothesize is rare) would it actually be necessary to do a random
access into the overflow table.

Thats appropriate sometimes, not others, but I'll investigate this
further so that its possible to take advantage of non-zero fillfactors
when they exist.

There's a number of distinct use-cases here:

If you have a very small, heavily updated table it makes a lot of sense
to use lower fillfactors as well.

If you have a larger table, using fillfactor 50% immediately doubles the
size of the table. If the updates are uneven, as they mostly are because
of the Benfold distribution/Pareto principle, then it has been found
that leaving space on block doesn't help the heavily updated portions of
a table, whereas it hinders the lightly updated portions of a table.

TPC-C and TPC-B both have uniformly distributed UPDATEs, so its easy to
use the fillfactor to great advantage there.

More generally, do we need an overflow table at all, rather than having
these overflow tuples living in the same file as the root tuples? As
long as there's a bit available to mark a tuple as being this special
not-separately-indexed type, you don't need a special location to know
what it is. This might break down in the presence of seqscans though.

HOT currently attempts to place a subsequent UPDATE on the same page of
the overflow relation, but this doesn't happen (yet) for placing
multiple versions on same page. IMHO it could, but will think about it.

This allows the length of a typical tuple chain to be extremely short in
practice. For a single connection issuing a stream of UPDATEs the chain
length will no more than 1 at any time.

Only if there are no other transactions being held open, which makes
this claim a lot weaker.

True, but Nikhil has run tests that clearly show HOT outperforming
current situation in the case of long running transactions. The need to
optimise HeapTupleSatisfiesVacuum() and avoid long chains does still
remain a difficulty for both HOT and the current situation.

HOT can only work in cases where a tuple does not modify one of the
columns defined in an index on the table, and when we do not alter the
row length of the tuple.

Seems like "altering the row length" isn't the issue, it's just "is
there room on the page for the new version". Again, a generous
fillfactor would give you more flexibility.

The copy-back operation can only work if the tuple fits in the same
space as the root tuple. If it doesn't you end up with a tuple
permanently in the overflow relation. That might not worry us, I guess.

Also, my understanding was that an overwrite operation could not vary
the length of a tuple (at least according to code comments).

[We'll be able to do that more efficiently when
we have plan invalidation]

Uh, what's that got to do with it?

Currently the HOT code dynamically tests to see if the index columns
have been touched. If we had plan invalidation that would be able to be
assessed more easily at planning time, in cases where there is no BEFORE
trigger.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

#20Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Simon Riggs (#19)
Re: Frequent Update Project: Design Overview of HOTUpdates

Simon Riggs wrote:

On Thu, 2006-11-09 at 18:28 -0500, Tom Lane wrote:

"Simon Riggs" <simon@2ndquadrant.com> writes:

HOT can only work in cases where a tuple does not modify one of the
columns defined in an index on the table, and when we do not alter the
row length of the tuple.

Seems like "altering the row length" isn't the issue, it's just "is
there room on the page for the new version". Again, a generous
fillfactor would give you more flexibility.

The copy-back operation can only work if the tuple fits in the same
space as the root tuple. If it doesn't you end up with a tuple
permanently in the overflow relation. That might not worry us, I guess.

Also, my understanding was that an overwrite operation could not vary
the length of a tuple (at least according to code comments).

You can't If someone else has the page pinned,

[We'll be able to do that more efficiently when
we have plan invalidation]

Uh, what's that got to do with it?

Currently the HOT code dynamically tests to see if the index columns
have been touched. If we had plan invalidation that would be able to be
assessed more easily at planning time, in cases where there is no BEFORE
trigger.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#21Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Simon Riggs (#19)
#22Bruce Momjian
bruce@momjian.us
In reply to: Zeugswetter Andreas SB SD (#17)
#23Bruce Momjian
bruce@momjian.us
In reply to: Simon Riggs (#19)
#24Nikhil Sontakke
nikhil.sontakke@enterprisedb.com
In reply to: Simon Riggs (#19)
#25NikhilS
nikkhils@gmail.com
In reply to: Simon Riggs (#19)
#26NikhilS
nikkhils@gmail.com
In reply to: Bruce Momjian (#22)
#27Simon Riggs
simon@2ndQuadrant.com
In reply to: Zeugswetter Andreas SB SD (#17)
#28Pavan Deolasee
pavan.deolasee@gmail.com
In reply to: Simon Riggs (#27)
#29Zeugswetter Andreas SB SD
ZeugswetterA@spardat.at
In reply to: NikhilS (#26)
#30Pavan Deolasee
pavan.deolasee@gmail.com
In reply to: Tom Lane (#12)
#31Zeugswetter Andreas SB SD
ZeugswetterA@spardat.at
In reply to: Bruce Momjian (#22)
#32Zeugswetter Andreas SB SD
ZeugswetterA@spardat.at
In reply to: Simon Riggs (#27)
#33Zeugswetter Andreas SB SD
ZeugswetterA@spardat.at
In reply to: NikhilS (#25)
#34Simon Riggs
simon@2ndQuadrant.com
In reply to: Zeugswetter Andreas SB SD (#32)
#35Hannu Krosing
hannu@tm.ee
In reply to: Simon Riggs (#19)
#36Simon Riggs
simon@2ndQuadrant.com
In reply to: Hannu Krosing (#35)
#37NikhilS
nikkhils@gmail.com
In reply to: Pavan Deolasee (#30)
#38NikhilS
nikkhils@gmail.com
In reply to: Zeugswetter Andreas SB SD (#33)
#39Simon Riggs
simon@2ndQuadrant.com
In reply to: Zeugswetter Andreas SB SD (#33)
#40Simon Riggs
simon@2ndQuadrant.com
In reply to: Zeugswetter Andreas SB SD (#31)
#41Robert Treat
xzilla@users.sourceforge.net
In reply to: Simon Riggs (#27)
#42Simon Riggs
simon@2ndQuadrant.com
In reply to: Robert Treat (#41)
#43Robert Treat
xzilla@users.sourceforge.net
In reply to: Simon Riggs (#42)
#44Simon Riggs
simon@2ndQuadrant.com
In reply to: Robert Treat (#43)
#45Csaba Nagy
nagy@ecircle-ag.com
In reply to: Simon Riggs (#44)
#46August Zajonc
augustz@augustz.com
In reply to: Simon Riggs (#44)
#47Hannu Krosing
hannu@tm.ee
In reply to: Csaba Nagy (#45)
#48Simon Riggs
simon@2ndQuadrant.com
In reply to: Hannu Krosing (#47)
#49Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Hannu Krosing (#47)
#50Simon Riggs
simon@2ndQuadrant.com
In reply to: Kevin Grittner (#49)