UNDO and in-place update

Started by Robert Haasover 9 years ago62 messageshackers
Jump to latest
#1Robert Haas
robertmhaas@gmail.com

PostgreSQL tables and indexes not infrequently become bloated, and
this bloat is difficult to reverse once it has occurred. When a row
is updated, a new row version is created and the space occupied by the
old version can be eventually be recycled. Since there is always a
short delay and sometimes a very long delay (hours) between the time
when the new row version is created and the time the old row version
can be removed, tables grow to accommodate storing both the old and
new row versions. When the old versions are reclaimed, there is no
easy way to shrink the table, because the free space is spread through
the table's data blocks rather than concentrated at the end of the
file. Moreover, since the space used by old row versions are only
reclaimed lazily -- either by HOT pruning or by VACUUM -- the table
may be further extended even space for the new row version could have
been made available by more timely cleanup. Indexes have similar
issues: even though in theory there is more latitude to consolidate
free space, we don't always do so, and lazy reclamation of row
versions may cause relation extension. One factor contributing to
index bloat is that any non-HOT update to a heap tuple requires an
update to every index even if the indexed columns are not updated.

How can we get ahead of this problem? One approach is to make it
easier to reorganize the table; for example, a variant of CLUSTER that
doesn't block concurrent DML would be useful. However, that's really
only a stopgap, because it only makes it easier to salvage a table
where internal free space has become badly fragmented. What we want to
do is prevent free space from becoming fragmented in the first place.
Most other relational database systems do this by placing old row
versions in a separate data structure from the table itself where they
can easily be removed when no longer needed. A row version becomes
"old" when it is updated or deleted; the "new" version enters the main
heap while the "old" version moves elsewhere.

To implement this in PostgreSQL, I believe we would need support for
UNDO. Most database systems have both REDO -- what PostgreSQL
typically calls the write-ahead log (WAL) -- and UNDO. REDO is
responsible for restoring physical consistency after a database crash,
while UNDO is responsible for reversing the effects of aborted
transactions. When a transaction performs an operation (DO), it also
writes it to the write-ahead log (REDO) and records the information
needed to reverse it (UNDO). If the transaction aborts, UNDO is used
to reverse all changes made by the transaction. Both the actions
performed (DO) and the UNDO created for those actions are protected by
REDO. Thus, UNDO must be written to disk at checkpoint time; UNDO
since the most recent checkpoint prior to a crash is regenerated by
WAL replay (REDO). So, after a crash, we first replay WAL (REDO) and
then reverse out any transactions implicitly aborted by the crash
(UNDO). This basic DO-UNDO-REDO protocol has been well-understood for
decades. It would likely be worthwhile to implement DO-UNDO-REDO in
PostgreSQL independently of any design for reducing bloat, because it
would provide a systematic framework for handling cleanup actions that
must be performed at the end of recovery. For example, if a
transaction creates a table and, while that transaction is still in
progress, there is an operating system or PostgreSQL crash, the
storage used by the table is permanently leaked. This could be fixed
by having the operation that creates the table also emit UNDO which
would unlink the backing storage in the event of an abort.

We could have a single shared undo log to which all transactions
write, but the insertion point would almost certainly become a point
of contention. Moreover, it's not very desirable to interleave UNDO
from different transactions, because it makes cleanup difficult.
Suppose one transaction aborts while another continues to run; if
their UNDO records are interleaved, it will be difficult to recover
any space if one set of UNDO records can be discarded but the other
must be kept. Instead, it seems best to have a separate UNDO log for
each transaction. Each of these UNDO logs individually will be
written sequentially and will consist of a series of variable-length
records. When a particular UNDO log is no longer needed, we can
remove the whole thing at once. Unless the UNDO log spilled to disk
because it was large or because of a checkpoint during the
transaction, this is just a matter of deallocating or recycling
whatever shared memory we're using to store it; otherwise, we'll need
to remove or recycle the file, or the portion of a file, that was used
to persist it on disk.

Once we have UNDO, we could create a new type of heap where we emit
UNDO for each INSERT, UPDATE, or DELETE. For an INSERT, we will
insert the new write and emit UNDO which will remove it. For a
DELETE, we will immediately remove the old row version but emit UNDO
which will put it back. For an UPDATE, we can proceed in two ways,
depending on the situation. First, we can handle an UPDATE much as if
it were a DELETE combined with an INSERT, removing the old version,
adding the new one, and emitting UNDO to reverse both changes.
Second, in some cases, we can instead perform an in-place update,
modifying the existing record in place and emitting UNDO to restore
the old version. It will be desirable to use the second strategy as
often as possible, because an in-place update does not bloat the heap.

Of course, there a few problems to solve. Since the only way to find
an old row version is to look at UNDO, we will need to retain UNDO for
as long as it might contain row versions that some overlapping
transaction needs to see. This means that we must retain UNDO for (1)
all transactions which are in progress, (2) for aborted transactions
until such time as all of the UNDO actions have been performed, and
(3) for committed transactions until they are all-visible. Note that
(1) and (2) would be required no matter what; UNDO is fundamentally a
system for tracking actions to performed during transaction abort.
(3) is only required because of our desire to keep old row versions
out of the heap. We could limit the length of time for which UNDO in
category (3) is kept in order to implement "snapshot too old" for this
type of heap. (Or we could keep it for an extra-long time to
implement time travel.)

Our current heap format requires each tuple to carry a very wide
header: 24-byte header -- or wider if the null bitmap is large. 18 of
those bytes plus assorted flag bits are only there for MVCC purposes.
For static data, all of that space is wasted. Moreover, we can easily
end up dirtying the same page multiple times to set hint bits and,
later, to freeze XMIN and XMAX, resulting in repeated page writes.
With UNDO, we can do better. If all UNDO for a given page have been
removed, then all aborted transactions have been undone and all
committed transactions are all-visible; thus, every tuple on that page
is all-visible. Therefore, the page itself does not need to contain
any per-tuple visibility information; it only needs to be possible to
find any UNDO pertaining to that page. Since the UNDO will be removed
at exactly the same time that the visibility information is no longer
needed, so we can store any required visibility information in the
UNDO itself.

Just as a particular WAL record is referenced using a byte position, a
particular UNDO record can be referenced by uniquely identifying the
transaction and the byte position within that transaction's UNDO log.
To uniquely identify a particular transaction, it seems we will need 8
bytes: the 4-byte epoch and the 4-byte XID. To uniquely identify a
position within that transaction's UNDO log, we will perhaps need
another 8 bytes. Thus, in total, an UNDO pointer will be 16 bytes.
We could perhaps reduce the size to 12 bytes by stipulating that a
transaction can only write 4GB of UNDO using a particular XID; if it
needed to generate more UNDO than that, it would need to allocate 1
additional XID for every 4GB of UNDO generated. Pages in this new
type of heap will have space to store an UNDO pointer (epoch + XID +
byte offset) in the page header. UNDO pointers where the XID is 0, 1,
or 2 cannot refer to a real UNDO log, because those XIDs are reserved
by PostgreSQL. Accordingly, we can reserve the all-zeroes UNDO pointer
as an invalid UNDO pointer. Also, let's reserve all UNDO pointers
where the XID is 1 as temporary page data (TPD) pointers. When a page
has been recently modified by only one transaction, the page's UNDO
pointer points to the most recent UNDO record generated by that
transaction for that page. There's no need to reset the pointer when
the UNDO log is removed; instead, we interpret a pointer to an UNDO
log which no longer exists as a sure sign that the page is all-frozen.
When a page has been recently modified by more than one transaction,
the page's UNDO pointer is a TPD pointer.

A TPD pointer is a 64-bit reference to a TPD record. An UNDO pointer
will be either 12 or 16 bytes, so there's definitely room to store 64
additional bits there in addition to the XID of 1. TPD records are
stored in the TPD log, which is managed much like the write-ahead log:
records are accessed by byte offset, the log begins at byte position 0
and grows upward forever, and the older portions of the log can
discarded once they are no longer needed. Unlike WAL, however, an
already-written TPD record can be modified in place. New TPD records
and changes to TPD records must be protected by WAL. A TPD will
contain one UNDO pointer per recent transaction (in progress, aborted
but not yet cleaned up, committed but not yet all-visible), indicating
the most recent change to the page by that transaction. Therefore,
it's always possible to find all of the recent transactions which have
modified a page and the most recent change by each one. Each UNDO
record for a particular page should contain a back-pointer to the
prior change to that same page by the same transaction, if any, so
once we've found the most recent change to the page by each recent
transaction we can use these back-pointers to find all of the other
ones as well. And that lets us find any old tuple versions we need.

If a transaction modifies a page whose existing UNDO pointer is
invalid, or one whose existing UNDO pointer points into its own UNDO
log, it can just overwrite the existing UNDO pointer with a pointer to
the new change. Likewise, if the UNDO pointer is a TPD pointer and the
modifying transaction is already present in the TPD record, it can
just update the TPD record with the UNDO pointer for the new
modification. Alternatively, if the UNDO pointer is a TPD pointer and
one of the transactions in the TPD is no longer recent, it can grab
that slot in the existing TPD for its own UNDO pointer. However, if
the page points to some other transaction's UNDO log and that
transaction is still recent, or if the page points to a TPD all of
whose members are still recent, then the modifying transaction must
create a new and larger TPD for the page. The old TPD can then be
marked as garbage and reclaimed. Otherwise, a TPD entry can be
reclaimed when every UNDO pointer it contains points to an UNDO log
which has already been discarded. We can treat a page with a pointer
to a TPD entry which has been discarded just like a page with a
pointer to an UNDO log that no longer exists: the page is all-visible.

In-place updates are fairly obviously safe and possible when no index
columns have been modified and the new tuple is no larger than the old
one. The indexes don't need to know anything about the change, and
the new tuple will definitely fit. In other cases, a bit more thought
is needed. If the new tuple is larger than the old one, it may still
be possible to do the update in place. The simplest case is when
there happens to be unused space following the existing tuple, and the
new and larger tuple will fit into that space. In that case, we do
not need to do anything special. Alternatively, it may be that there's
enough free space elsewhere on the page to accommodate the new tuple.
In that case, we could take a cleanup lock and reorganize the page.
However, if we do, we must be careful to leave enough space to permit
a successful UNDO. For example, imagine a completely full page where
we try to shrink one tuple in one transaction and enlarge a different
tuple in some other transaction. If we allow both updates to be in
place, we will have a serious problem if the shrinking transaction
aborts and the enlarging transaction commits.

If any indexed columns are changed, an in-place update might confuse
the index AM. Index AMs like BRIN which do not store TIDs don't care,
but those that do will get confused, because there will now be entries
for multiple index values pointing to the same TID. That could
possibly be handled by forbidding index-only scans and requiring
rechecks in all cases, but that would be quite inefficient. Instead,
we would likely wish to adopt the solution used by other database
systems which work on principles similar to those described here:
whenever we update or delete a tuple, use the old tuple to re-find any
index entries that might no longer be valid, and apply a delete-mark
to them. When scanning an index, perform rechecks for all
delete-marked index items. Claudio's proposal to keep otherwise-equal
btreek ys sorted by TID is probably essential for delete-marking to
perform well. Note that delete-marking isn't strictly required: we
could simply choose not to do in-place updates any time an indexed
column is modified. However, that leaves a lot of performance on the
table, because HOT updates are already fairly efficient; we want to
also reduce the bloat associated with updates that DO touch indexed
columns, and the write amplification that comes from touching every
index.

So, what are the pros and cons of this system?

- Performing updates in place where possible prevents bloat from being
created. It does not prevent all bloat: aside from any updates that
are not dead in place, a transaction that inserts many rows and aborts
or deletes many rows and commits will still leave the table larger
than the optimum size. However, many workloads have more updates than
they do inserts or deletes, so this is probably a significant win.

- Old tuple versions are removed from the heap eagerly (by UNDO
cleanup at the end of a transaction) rather than lazily (by HOT
pruning and VACUUM). This increases the chance of being able to reuse
internal free space rather than extending the relation. Also, it's
probably better to incur the overhead of cleaning up after an abort
immediately rather than hours, days, weeks, or months later when
VACUUM kicks in; VACUUM behavior often surprises users. Better yet,
in the case where most transactions commit, we really don't really
need to do any after-the-fact cleanup at all; the only thing that we
have to do in that case is throw away the UNDO logs for transactions
that have become all-visible, which is hopefully a cheap operation.

- Reading a page that has been recently modified gets significantly
more expensive; it is necessary to read the associated UNDO entries
and do a bunch of calculation that is significantly more complex than
what is required today. The problem is more severe when TPD entries
are required. There are probably some steps that can be taken to
mitigate this problem, such as setting aside a bit per tuple to mean
"this tuple has not been recently modified", or caching additional
details in the TPD when one is used, but it's unlikely to be possible
to eliminate it completely.

- Most things that could cause a page to be rewritten multiple times
are eliminated. Tuples no longer need to be frozen; instead, pages
are implicitly frozen by the removal of associated UNDO. Similarly,
hint bits are gone. The page-level all-visible bit needs some
thought; we'll still need a visibility map to support index-only
scans, but we probably don't want the page level bit any more, because
rewriting the whole heap just to set that one bit would be terrible.

- Delete-marking makes deletes more expensive, because they have to
delete-mark all of the index entries. Updates are more expensive when
all of the index columns change, because now each index needs to be
updated in two ways (new index entries and delete-marking of old ones)
rather than one. However, if some indexed columns are updated but
only a small minority of the indexes on the table include those
columns, updates are substantially cheaper, because new index entries
and delete-marking of old ones are required only for those indexes
where some column has been updated. On the whole, it seems likely
that we'd come out ahead here on many workloads, because updates that
change only one or a few indexed columns are probably much more common
than updates which change them all.

- Delete-marking assumes that we’re able to re-find index tuples. The
system today does not require such an assumption.

- Tuple headers become much smaller. The page header becomes bigger,
but there's only one page header per page, and there are many tuple
headers per page, so on average we should come out ahead.

- Transaction abort can be lengthy if much UNDO is required. This
problem can be mitigated by pushing the UNDO steps into the
background.

Thoughts?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#1)
Re: UNDO and in-place update

Robert Haas <robertmhaas@gmail.com> writes:

[ Let's invent Oracle-style UNDO logs ]

I dunno. I remember being told years ago, by an ex-Oracle engineer,
that he thought our approach was better. I don't recall all the details
of the conversation but I think his key point was basically this:

- Reading a page that has been recently modified gets significantly
more expensive; it is necessary to read the associated UNDO entries
and do a bunch of calculation that is significantly more complex than
what is required today.

Oracle spends a lot of time on this, and it's really cache-inefficient
because the data is spread all over. This was what this guy felt in
circa 2001; I'd have to think that the cache unfriendliness problem is
much worse for modern hardware.

Which is not to say that it's not worth experimenting with. But what you
describe is going to be a huge amount of work even to get to the point
where we could try to measure the actual costs and benefits :-(

Heikki's been fooling with some ideas that I think have more promise.
I wish he'd get to the point of presenting them publicly rather than
just over beers at conferences.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#2)
Re: UNDO and in-place update

On Tue, Nov 22, 2016 at 10:31 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

[ Let's invent Oracle-style UNDO logs ]

I dunno. I remember being told years ago, by an ex-Oracle engineer,
that he thought our approach was better. I don't recall all the details
of the conversation but I think his key point was basically this:

- Reading a page that has been recently modified gets significantly
more expensive; it is necessary to read the associated UNDO entries
and do a bunch of calculation that is significantly more complex than
what is required today.

Oracle spends a lot of time on this, and it's really cache-inefficient
because the data is spread all over. This was what this guy felt in
circa 2001; I'd have to think that the cache unfriendliness problem is
much worse for modern hardware.

Yes, that's the main thing I'm worried about with this design. Now,
bloat is pretty cache-inefficient too. If our table is 2x the size of
Oracle's table, they can spend more cycles per page and still do
fairly well. If that 2x increment pushes us out of RAM, they can
spend basically all the CPU time they want and still win handily. But
that isn't to say that this concern isn't justified.

Which is not to say that it's not worth experimenting with. But what you
describe is going to be a huge amount of work even to get to the point
where we could try to measure the actual costs and benefits :-(

Yeah.

Heikki's been fooling with some ideas that I think have more promise.
I wish he'd get to the point of presenting them publicly rather than
just over beers at conferences.

That would be good, too!

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

In reply to: Robert Haas (#3)
Re: UNDO and in-place update

On Tue, Nov 22, 2016 at 7:39 PM, Robert Haas <robertmhaas@gmail.com> wrote:

Heikki's been fooling with some ideas that I think have more promise.
I wish he'd get to the point of presenting them publicly rather than
just over beers at conferences.

That would be good, too!

I was told that TED was kind of formally proposed in the "MultiXact
hindsight design" thread.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#4)
Re: UNDO and in-place update

On Tue, Nov 22, 2016 at 10:41 PM, Peter Geoghegan <pg@heroku.com> wrote:

On Tue, Nov 22, 2016 at 7:39 PM, Robert Haas <robertmhaas@gmail.com> wrote:

Heikki's been fooling with some ideas that I think have more promise.
I wish he'd get to the point of presenting them publicly rather than
just over beers at conferences.

That would be good, too!

I was told that TED was kind of formally proposed in the "MultiXact
hindsight design" thread.

I read that, but:

1. Nothing's really come of that, AFAICS, and

2. It doesn't allow for in-place update, and

3. It doesn't eliminate the need for freezing.

Implementing TED would probably have been better than doing fkeylocks
the way we did, but now that we've mostly stabilized the multixact
work that was actually done, I have a hard time imagining that we
really want to revamp the whole thing without fixing any of the more
fundamental problems with our on-disk format. In my mind, the bloat
created by storing both the old and new versions of an updated tuple
in the heap, and the need to rewrite pages multiple times to set hint
bits, set all-visible, and ultimately freeze are the three-ton
gorillas in the corner. I am not arguing that what I've outlined here
is the only way to fix those problems or even the best way, just that
it isn't really worth the trouble of doing major surgery if we aren't
going to make a dent in those problems.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

In reply to: Tom Lane (#2)
Re: UNDO and in-place update

On Tue, Nov 22, 2016 at 7:31 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

- Reading a page that has been recently modified gets significantly
more expensive; it is necessary to read the associated UNDO entries
and do a bunch of calculation that is significantly more complex than
what is required today.

Someone told me that there is something called an interested
transaction list stored in the page header, and from that I infer that
isn't *really* true. I think that unless you're interested in an
affected row, rather than just some row that happens to be on the same
page, you don't really have to worry about it.

Oracle spends a lot of time on this, and it's really cache-inefficient
because the data is spread all over. This was what this guy felt in
circa 2001; I'd have to think that the cache unfriendliness problem is
much worse for modern hardware.

I imagine that temporal locality helps a lot. Most snapshots will be
interested in the same row version.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Geoghegan (#6)
Re: UNDO and in-place update

Peter Geoghegan <pg@heroku.com> writes:

On Tue, Nov 22, 2016 at 7:31 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Oracle spends a lot of time on this, and it's really cache-inefficient
because the data is spread all over. This was what this guy felt in
circa 2001; I'd have to think that the cache unfriendliness problem is
much worse for modern hardware.

I imagine that temporal locality helps a lot. Most snapshots will be
interested in the same row version.

Yeah, but the problem is that all those transactions have to scavenge
through a big chunk of UNDO log to find out what they need to know.

Ultimately, I doubt that update-in-place buys much that we don't already
have with HOT updates (which postdate this old conversation, btw).
If you want MVCC semantics, you need to hold both versions of the tuple
*somewhere*. Oracle's solution is different from ours, but I'm not
exactly convinced that it's better. It makes some aspects better and
others worse.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Amit Kapila
amit.kapila16@gmail.com
In reply to: Peter Geoghegan (#6)
Re: UNDO and in-place update

On Wed, Nov 23, 2016 at 9:32 AM, Peter Geoghegan <pg@heroku.com> wrote:

On Tue, Nov 22, 2016 at 7:31 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

- Reading a page that has been recently modified gets significantly
more expensive; it is necessary to read the associated UNDO entries
and do a bunch of calculation that is significantly more complex than
what is required today.

Someone told me that there is something called an interested
transaction list stored in the page header, and from that I infer that
isn't *really* true. I think that unless you're interested in an
affected row, rather than just some row that happens to be on the same
page, you don't really have to worry about it.

Yeah, so basically if there is an effect of any transaction which is
not visible to the snapshot of transaction reading the page, you need
to do something to read the old row/rows present on that page.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

In reply to: Robert Haas (#1)
Re: UNDO and in-place update

On Tue, Nov 22, 2016 at 7:01 PM, Robert Haas <robertmhaas@gmail.com> wrote:

This basic DO-UNDO-REDO protocol has been well-understood for
decades.

FWIW, while this is basically true, the idea of repurposing UNDO to be
usable for MVCC is definitely an Oracleism. Mohan's ARIES paper says
nothing about MVCC.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10Amit Kapila
amit.kapila16@gmail.com
In reply to: Tom Lane (#7)
Re: UNDO and in-place update

On Wed, Nov 23, 2016 at 9:48 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Peter Geoghegan <pg@heroku.com> writes:

On Tue, Nov 22, 2016 at 7:31 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Oracle spends a lot of time on this, and it's really cache-inefficient
because the data is spread all over. This was what this guy felt in
circa 2001; I'd have to think that the cache unfriendliness problem is
much worse for modern hardware.

I imagine that temporal locality helps a lot. Most snapshots will be
interested in the same row version.

Yeah, but the problem is that all those transactions have to scavenge
through a big chunk of UNDO log to find out what they need to know.

I think that depends on how we design to read the old data, if every
transaction has to go through UNDO log then it can impact the
performance, however, if we design such that the transaction can take
help from the previous transaction which has gone through undo log,
then it will be much better. For that to happen, I think we need some
kind of page level MVCC and that should be possible in design proposed
above where the latest transactions information is present at the page
level.

Ultimately, I doubt that update-in-place buys much that we don't already
have with HOT updates (which postdate this old conversation, btw).
If you want MVCC semantics, you need to hold both versions of the tuple
*somewhere*. Oracle's solution is different from ours, but I'm not
exactly convinced that it's better. It makes some aspects better and
others worse.

I think not only Oracle but some of the other database systems like
SQL Server (and IIRC MySQL) does a better job in terms of bloat
management (by keeping it separate from main Heap) which is why in
many cases they are preferred over PostgreSQL.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

In reply to: Tom Lane (#7)
Re: UNDO and in-place update

On Tue, Nov 22, 2016 at 8:18 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Ultimately, I doubt that update-in-place buys much that we don't already
have with HOT updates (which postdate this old conversation, btw).
If you want MVCC semantics, you need to hold both versions of the tuple
*somewhere*. Oracle's solution is different from ours, but I'm not
exactly convinced that it's better. It makes some aspects better and
others worse.

I agree that HOT is roughly as effective, at least when you get mostly
HOT updates.

I think that there is an under-appreciated issue with index bloat and
VACUUM. Bloat in the sense of a B-Tree becoming unbalanced. I think
that short term bursts of non-HOT updates bloat the *keyspace* of
B-Trees, making them unbalanced. A short term burst (possibly
associated with one long running transaction that prevents pruning)
can cause long term damage to the key space.

The best thing by far about an alternative design like this is that it
performs *consistently*. I believe that we will lose on real-world
cases that play to the strengths of the current design. That doesn't
mean that it isn't worth it, but it does make it exceptional
difficult.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Geoghegan (#11)
Re: UNDO and in-place update

Peter Geoghegan <pg@heroku.com> writes:

The best thing by far about an alternative design like this is that it
performs *consistently*.

Really? I think it just moves the issues somewhere else.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13Mark Kirkwood
mark.kirkwood@catalyst.net.nz
In reply to: Tom Lane (#2)
Re: UNDO and in-place update

On 23/11/16 16:31, Tom Lane wrote:

Robert Haas <robertmhaas@gmail.com> writes:

[ Let's invent Oracle-style UNDO logs ]

I dunno. I remember being told years ago, by an ex-Oracle engineer,
that he thought our approach was better. I don't recall all the details
of the conversation but I think his key point was basically this:

- Reading a page that has been recently modified gets significantly
more expensive; it is necessary to read the associated UNDO entries
and do a bunch of calculation that is significantly more complex than
what is required today.

Also ROLLBACK becomes vastly more expensive than COMMIT (I can recall
many years ago when I used to be an Oracle DBA reading whole chapters of
novels waiting for failed batch jobs to roll back).

However I'd like to add that I agree this is worth looking at, as
ideally it would be great to be able to choose whether to have No-UNDO
or UNDO on a table by table basis...

regards

Mark

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14Pavan Deolasee
pavan.deolasee@gmail.com
In reply to: Peter Geoghegan (#11)
Re: UNDO and in-place update

On Wed, Nov 23, 2016 at 10:07 AM, Peter Geoghegan <pg@heroku.com> wrote:

On Tue, Nov 22, 2016 at 8:18 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Ultimately, I doubt that update-in-place buys much that we don't already
have with HOT updates (which postdate this old conversation, btw).
If you want MVCC semantics, you need to hold both versions of the tuple
*somewhere*. Oracle's solution is different from ours, but I'm not
exactly convinced that it's better. It makes some aspects better and
others worse.

I agree that HOT is roughly as effective, at least when you get mostly
HOT updates.

I agree. We'd tried update-in-place before writing HOT and it turned out to
be complex and error-prone [1]/messages/by-id/1163092396.3634.461.camel@silverbirch.site -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services. I am not suggesting that we can't do a
better job this time, but as Tom said, it's going to be lot of work. If
WARM makes any progress, it will also help addressing some of the issues
around index bloats when HOT updates can not be done.

One key issue that I see is that unrelated long running transactions end up
holding up OldestXmin advances and that leads to significant bloat to
update-heavy tables. If we could somehow address that situation, that may
still go a long way in containing bloat. Of course, there may still be
cases where bloat will be unavoidable, but it seems far lesser work to me
that a complete overhaul of the MVCC working.

Thanks,
Pavan

[1]: /messages/by-id/1163092396.3634.461.camel@silverbirch.site -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
/messages/by-id/1163092396.3634.461.camel@silverbirch.site
--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In reply to: Tom Lane (#12)
Re: UNDO and in-place update

On Tue, Nov 22, 2016 at 8:45 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Peter Geoghegan <pg@heroku.com> writes:

The best thing by far about an alternative design like this is that it
performs *consistently*.

Really? I think it just moves the issues somewhere else.

Definitely, yes.

* HOT is great and all, but HOT-safety is a pretty onerous thing to
ask users to worry about. They don't know anything about it. Maybe
WARM eventually helps with this -- I don't know about that.

* It's bad that the effectiveness of index-only scans is strongly
influenced by the visibility map. And, autovacuum doesn't care about
index-only scans (nor should it, I suppose).

* The high watermark size of a table is much higher for an
update-mostly workload. When bloat is localized to certain values in
indexes, it's a lot worse.

* Our behavior with many duplicates in secondary indexes is pretty bad
in general, I suspect. I think we do badly at amortizing inserts into
indexes from updates (we dirty more leaf pages than we really need
to). I think we could do better at making LP_DEAD style IndexTuple
recycling more effective.

* Most obviously, autovacuum can create significant load at
inconvenient times. Paying that cost in a piecemeal fashion has a
certain appeal.

For what it's worth, I am not planning to work on this.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16Laurenz Albe
laurenz.albe@cybertec.at
In reply to: Robert Haas (#1)
Re: UNDO and in-place update

Robert Haas wrote:

To implement this in PostgreSQL, I believe we would need support for
UNDO.

- Reading a page that has been recently modified gets significantly
more expensive; it is necessary to read the associated UNDO entries
and do a bunch of calculation that is significantly more complex than
what is required today.

As others have said, that is the main drawback.

Also, and related to this, ROLLBACK would become an expensive operation.

Yours,
Laurenz Albe

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#7)
Re: UNDO and in-place update

On Tue, Nov 22, 2016 at 11:18 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Peter Geoghegan <pg@heroku.com> writes:

On Tue, Nov 22, 2016 at 7:31 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Oracle spends a lot of time on this, and it's really cache-inefficient
because the data is spread all over. This was what this guy felt in
circa 2001; I'd have to think that the cache unfriendliness problem is
much worse for modern hardware.

I imagine that temporal locality helps a lot. Most snapshots will be
interested in the same row version.

Yeah, but the problem is that all those transactions have to scavenge
through a big chunk of UNDO log to find out what they need to know.

In my design, there's one UNDO log per transaction, so if a page has
been recently modified only by a single transaction whose effects are
visible to your snapshot, you can very quickly decide that you don't
need to look at UNDO at all. If it's been recently modified by
multiple transactions all of which are visible to your snapshot, you
can consult the TPD and then decide that none of those transactions
are interesting and therefore their UNDO doesn't need to be examined.

When there are transactions that whose effects are not visible to your
snapshot, the cost is proportional to the number of times those
transactions have modified the page. Therefore, the "bad" case for
this design is when the same page gets updated a large number of times
by one or more concurrent transactions. If the transaction you can't
see has made 437 separate modifications to the page, you're going to
need to look at all 437 UNDO records to figure out what you can see.
That is likely to suck.

I think that one of the crucial questions for this design is how often
that will happen. For people with mostly static data, it should be
rare. In something like a pgbench workload, it boils down to how
often multiple transactions hit the same page before the previous
transactions that hit that page become all-visible. Of course, if you
have too many backends relative to the scale factor, pgbench is going
to become lock-bound anyway and it probably won't matter much, but
it's about two orders of magnitude easier to hit the same page than to
hit the same tuple, so it could be that the point where you start
incurring significant costs for visibility checking and page
reconstruction happens well before the tuple lock contention becomes
noticeable. It might be worth doing some mathematical modeling to try
to estimate how serious that effect is likely to be.

There are also some things that you can do to mitigate this, although
they add even more complexity. For example:

- If the same transaction touches the same page repeatedly, you could
make a rule that every 5th or every 10th or every 20th modification to
the same page within the same transaction emits a larger summary
record into the UNDO that is a consolidated record of all changes made
to the page by that transaction thus far. That penalizes writers, but
helps readers; when a reader reaches a summary record, it need look no
further.

- Peter alluded to the possibility - or so I thought - of caching
reconstructed page versions. That would make a lot of sense for
sequential scans or cases where the same page is being accessed
repeatedly.

- If you're doing an index scan and you only care about one tuple, the
page could contain one bit per tuple which is set if the tuple has
been modified by any transaction the page views as recent. If the bit
is clear for the one tuple you care about, you can ignore the UNDO
even if is for a transaction not visible to your snapshot, because you
know it didn't touch that tuple.

- If you create a TPD entry for the page because there are multiple
recent transactions that have modified it, you can put additional
information into the TPD entry to speed up access. For example, you
could store an array indexed by itemid which indicates which of the
recent transactions have touched any given itemid. This is mostly
useful for index scans, which can then ignore all the transactions
that don't pertain to the one TID they care about.

Which of those ideas are best? Are they enough, even if we do them
all? Will there be other down sides? I don't know. But I think it's
clear that bloat is an ongoing and serious concern for our users (cf.
snapshot too old), and that vacuuming is too (cf. freeze map) and I
believe we are reaching a point where we can't make much further
improvement in those areas without some kind of major architectural
change. WARM might qualify; this idea certainly does. I also believe
that we NEED to improve further. At least for EnterpriseDB, the
problems I'm trying to address through this design are big problems.
If this design isn't good, let's come up with something else.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18Thomas Munro
thomas.munro@gmail.com
In reply to: Peter Geoghegan (#15)
Re: UNDO and in-place update

On Wed, Nov 23, 2016 at 6:01 PM, Peter Geoghegan <pg@heroku.com> wrote:

* Our behavior with many duplicates in secondary indexes is pretty bad
in general, I suspect.

From the pie-in-the-sky department: I believe there are
snapshot-based systems that don't ever have more than one entry for a
logical row in a btree. Instead, they do UNDO-log based snapshot
reconstruction for btrees too, generalising what is being discussed
for heap tables in this thread. That means that whenever you descend
a btree, at each page you have to be prepared to go and look in the
UNDO log if necessary on a page-by-page basis. Alternatively, some
systems use a shadow table rather than an UNDO log for the old
entries, but still privilege queries that need the latest version by
keeping only those in the btree. I don't know the details but suspect
that both of those approaches might require CSN (= LSN?) based
snapshots, so that you can make page-level visibility determinations
while traversing the 'current' btree based on a page header. That
would reduce both kinds of btree bloat: the primary kind of being
entries for dead tuples that still exist in the heap, and the
secondary kind being the resultant permanent btree fragmentation that
remains even after vacuum removes them. Presumably once you have all
that, you can allow index-only-scans without a visibility map. Also,
I suspect that such a "3D time travelling" btree would be a
requirement for index ordered tables in a snapshot-based RDBMS.

--
Thomas Munro
http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19Tsunakawa, Takayuki
tsunakawa.takay@jp.fujitsu.com
In reply to: Robert Haas (#1)
Re: UNDO and in-place update

IMHO, overall, there should be pros and cons of the current approach and the new UNDo one (like Oracle?), depending on the workload. Under update-heavy workload, the UNDO method may be better. OTOH, under the mostly-INSERT workload (like data warehouse?), the current method will be better because it writes no log for UNDO.

Furthermore, it maybe the best to be able to switch the method for each table and/or tablespace. For example, in pgbench, history table uses the current method,
and other tables use the UNDO method. Is it time to introduce a pluggable storage system?

Because PostgreSQL is a follower in the UNDO approach, I think it will be better to study other DBMSs well (Oracle and MySQL?). That includes not only their manuals, but also whitepapers and books. Especially, I expect good books to give deep knowledge on performance tuning and troubleshooting, from which we will be able to know the cons that Oracle's materials don't state.

Regards
Takayuki Tsunakawa

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

In reply to: Tsunakawa, Takayuki (#19)
Re: UNDO and in-place update

On Wed, Nov 23, 2016 at 11:32 PM, Tsunakawa, Takayuki
<tsunakawa.takay@jp.fujitsu.com> wrote:

IMHO, overall, there should be pros and cons of the current approach and the new UNDo one (like Oracle?), depending on the workload. Under update-heavy workload, the UNDO method may be better. OTOH, under the mostly-INSERT workload (like data warehouse?), the current method will be better because it writes no log for UNDO.

I believe that you are correct about that.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#21Bruce Momjian
bruce@momjian.us
In reply to: Peter Geoghegan (#20)
#22Bruce Momjian
bruce@momjian.us
In reply to: Peter Geoghegan (#9)
#23Robert Haas
robertmhaas@gmail.com
In reply to: Thomas Munro (#18)
#24Bruce Momjian
bruce@momjian.us
In reply to: Bruce Momjian (#22)
#25Thomas Kellerer
spam_eater@gmx.net
In reply to: Bruce Momjian (#22)
#26Robert Haas
robertmhaas@gmail.com
In reply to: Tsunakawa, Takayuki (#19)
#27Robert Haas
robertmhaas@gmail.com
In reply to: Bruce Momjian (#22)
#28Pavel Stehule
pavel.stehule@gmail.com
In reply to: Robert Haas (#27)
#29Robert Haas
robertmhaas@gmail.com
In reply to: Pavel Stehule (#28)
#30Pavel Stehule
pavel.stehule@gmail.com
In reply to: Robert Haas (#29)
#31Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#23)
#32Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#31)
#33Bruce Momjian
bruce@momjian.us
In reply to: Robert Haas (#26)
#34Bruce Momjian
bruce@momjian.us
In reply to: Robert Haas (#27)
#35Amit Kapila
amit.kapila16@gmail.com
In reply to: Bruce Momjian (#33)
#36Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#35)
#37Tsunakawa, Takayuki
tsunakawa.takay@jp.fujitsu.com
In reply to: Robert Haas (#26)
#38Robert Haas
robertmhaas@gmail.com
In reply to: Tsunakawa, Takayuki (#37)
#39Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#36)
#40Bruce Momjian
bruce@momjian.us
In reply to: Amit Kapila (#35)
#41Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#39)
#42Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#41)
#43Alexander Korotkov
aekorotkov@gmail.com
In reply to: Amit Kapila (#42)
#44Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#42)
#45Alexander Korotkov
aekorotkov@gmail.com
In reply to: Robert Haas (#44)
#46Robert Haas
robertmhaas@gmail.com
In reply to: Alexander Korotkov (#45)
#47Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#44)
#48Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#47)
#49Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#48)
#50Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#49)
#51Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#50)
#52Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#49)
#53Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#51)
#54Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#52)
#55Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#54)
#56Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#55)
#57Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#56)
#58Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#57)
#59Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#58)
#60Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#59)
#61Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#60)
#62Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#61)