An idle thought

Started by Greg Starkalmost 16 years ago25 messages
#1Greg Stark
stark@mit.edu

A few days ago there was a thread on one of our lists where someone
was suprised it took as much i/o to delete data as it took to insert
it in the first place. At first this does seem surprising but the fact
that Postgres stores its transaction information inline with the data
and does all i/o in blocks makes this inevitable.

http://archives.postgresql.org/pgsql-performance/2010-03/msg00141.php

However then I started thinking about this case and wondered if it
wouldn't be possible to optimize. One of the suggested optimizations
was to look at using TRUNCATE. But I wonder why it's necessary to use
a dedicated command. Shouldn't it be possible for the system to notice
this situation and do effectively the same thing itself?

I'm picturing storing a bit in the visibility map indicating that *no*
records are visible in a given page. If that bit is set then the page
is virtually empty even if it contains data. If anyone tries to load
the page they should just initialize a new page instead which will
overwrite it. That way a big batch delete just has to set a bunch of
flags in the visibility map to do the bulk delete.

There are a couple problems with the way I've described this idea
here. Firstly, if the deleting transaction hasn't committed yet or
later aborts then of course the records need to still be visible. So
either the visibility map would need an xmax for the page as a whole
or it would need to only be set after the page has actually been
vacuumed. Secondly there's the whole retail vacuum problem -- any
index entries referring to this page would be left dangling unless
there's some kind of retail vacuum or perhaps a page version number.

I'm not 100% sure this idea is workable but if it is it might make
batch deletes a lot less painful. Several orders of magnitude less i/o
whenever a single transaction deletes many rows.

--
greg

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Greg Stark (#1)
Re: An idle thought

Greg Stark <stark@mit.edu> writes:

However then I started thinking about this case and wondered if it
wouldn't be possible to optimize. One of the suggested optimizations
was to look at using TRUNCATE. But I wonder why it's necessary to use
a dedicated command. Shouldn't it be possible for the system to notice
this situation and do effectively the same thing itself?

Not unless you'd like DELETE to always acquire exclusive lock...

There are a couple problems with the way I've described this idea
here.

Precisely because of the lack of lock.

regards, tom lane

#3Simon Riggs
simon@2ndQuadrant.com
In reply to: Greg Stark (#1)
Re: An idle thought

On Tue, 2010-03-16 at 15:29 +0000, Greg Stark wrote:

big batch delete

Is one of the reasons for partitioning, allowing the use of truncate.

--
Simon Riggs www.2ndQuadrant.com

#4Jeff Davis
pgsql@j-davis.com
In reply to: Greg Stark (#1)
Re: An idle thought

On Tue, 2010-03-16 at 15:29 +0000, Greg Stark wrote:

I'm picturing storing a bit in the visibility map indicating that *no*
records are visible in a given page.

I've been thinking for a while that we could store the visibility
information in a structure separate from the heap -- sort of like the
visibility map, but per-tuple and authoritative rather than a per-page
hint.

There are all kinds of challenges there, but it might be worth thinking
about. Visibility information is highly compressible, and requires
constant maintenance (updates, deletes, freezing, etc.). It also might
make it possible to move to 64-bit xids, if we wanted to.

Regards,
Jeff Davis

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jeff Davis (#4)
Re: An idle thought

Jeff Davis <pgsql@j-davis.com> writes:

There are all kinds of challenges there, but it might be worth thinking
about. Visibility information is highly compressible, and requires
constant maintenance (updates, deletes, freezing, etc.). It also might
make it possible to move to 64-bit xids, if we wanted to.

If you want it to be cheaply updatable (or even cheaply readable),
compression is not what you're going to do.

regards, tom lane

#6Jeff Davis
pgsql@j-davis.com
In reply to: Tom Lane (#5)
Re: An idle thought

On Wed, 2010-03-17 at 17:20 -0400, Tom Lane wrote:

Jeff Davis <pgsql@j-davis.com> writes:

There are all kinds of challenges there, but it might be worth thinking
about. Visibility information is highly compressible, and requires
constant maintenance (updates, deletes, freezing, etc.). It also might
make it possible to move to 64-bit xids, if we wanted to.

If you want it to be cheaply updatable (or even cheaply readable),
compression is not what you're going to do.

I didn't mean that we'd want to compress it to the absolute minimum
size. I had envisioned that it would be a simple scheme designed only to
eliminate long runs of identical visibility information (perhaps only
the frozen and always visible regions would be compressed).

The extra level of indirection would be slower, but if we freeze tuples
more aggressively (which would be much cheaper if we didn't have to go
to the heap), there might be a small number of tuples with interesting
visibility information at any particular time.

Regards,
Jeff Davis

#7Simon Riggs
simon@2ndQuadrant.com
In reply to: Jeff Davis (#4)
Re: An idle thought

On Wed, 2010-03-17 at 14:09 -0700, Jeff Davis wrote:

I've been thinking for a while that we could store the visibility
information in a structure separate from the heap -- sort of like the
visibility map, but per-tuple and authoritative rather than a per-page
hint.

A lot of people have been thinking that for a while, but I've yet to see
a fully costed assessment of whether its worth doing. It might be.

It seems easier to focus on outcomes rather than specifics, then order
the possible solutions to those outcomes in cost-benefit sequence.

--
Simon Riggs www.2ndQuadrant.com

#8Gokulakannan Somasundaram
gokul007@gmail.com
In reply to: Tom Lane (#5)
Re: An idle thought

On Thu, Mar 18, 2010 at 2:50 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Show quoted text

Jeff Davis <pgsql@j-davis.com> writes:

There are all kinds of challenges there, but it might be worth thinking
about. Visibility information is highly compressible, and requires
constant maintenance (updates, deletes, freezing, etc.). It also might
make it possible to move to 64-bit xids, if we wanted to.

If you want it to be cheaply updatable (or even cheaply readable),
compression is not what you're going to do.

regards, tom lane

+1..

#9Gokulakannan Somasundaram
gokul007@gmail.com
In reply to: Greg Stark (#1)
Re: An idle thought

Secondly there's the whole retail vacuum problem -- any
index entries referring to this page would be left dangling unless
there's some kind of retail vacuum or perhaps a page version number.

The issue, we can divide into two
a)volatile functions
b)broken datatypes

For a) I think volatile function issue can be resolved by using hidden
columns in the heap itself. This will create a duplication of data, but
since the index will get created on it, it will always point to the right
heap tuple

For b) We are already suffering with this issue in any index lookups, index
based updates/deletes, unique constraints, referential integrity maintenance
etc. So hopefully one day we can extend this list to include more :))

Gokul.

In reply to: Jeff Davis (#6)
Re: An idle thought

I didn't mean that we'd want to compress it to the absolute minimum
size. I had envisioned that it would be a simple scheme designed only to
eliminate long runs of identical visibility information (perhaps only
the frozen and always visible regions would be compressed).

The extra level of indirection would be slower, but if we freeze tuples
more aggressively (which would be much cheaper if we didn't have to go
to the heap), there might be a small number of tuples with interesting
visibility information at any particular time.

This should be achievable with current proposal of Heikki, but I think it

is useful only for tables which won't have more concurrent operations and on
databases without any long running queries. So if we have an option to
create a visibiliy map ( on the lines of materialized view ), whenever we
want for a table, it would help a good number of use cases, i suppose.

Thanks,
Gokul.

#11Chris Browne
cbbrowne@acm.org
In reply to: Greg Stark (#1)
Re: An idle thought

simon@2ndQuadrant.com (Simon Riggs) writes:

On Tue, 2010-03-16 at 15:29 +0000, Greg Stark wrote:

big batch delete

Is one of the reasons for partitioning, allowing the use of truncate.

Sure, but it would be even nicer if DELETE could be thus made cheaper
without needing to interfere with the schema.

The concurrency issue might be resolved (*might!*) by the following
complication...

- A delete request is looking at a page, and concludes, "oh, all the
tuples here are now marked dead!".

- It flags the page as *possibly* dead. Almost what Greg suggests for
the visibility map, but this is just marking it as "proposed dead."

- It throws the page number, along with xid, into a side map.

When something wants to do something with the page (e.g. - vacuum), it
sees that it's "possibly dead," and looks at the "side map" for the list
of xids that wanted to mark the page dead.

for each xid:
if xid is still active
do nothing with it
else
remove xid entry from the map

if all xids were failed
remove flag from page
if any xid committed
empty the page; the tuples are all dead

I'm less confident about that last "clause" - I *think* that if *any*
page-clearing XID is found, that means the page is well and truly clear,
doesn't it?

The extra "map" mayn't be a nice thing.

It's food for thought, anyways.
--
let name="cbbrowne" and tld="linuxfinances.info" in String.concat "@" [name;tld];;
The real problem with the the year 2000 is that there are too many
zero bits and that adversely affects the global bit density.
-- Boyd Roberts <boyd@france3.fr>

#12Jeff Davis
pgsql@j-davis.com
In reply to: Gokulakannan Somasundaram (#8)
Re: An idle thought

On Thu, 2010-03-18 at 14:29 +0530, Gokulakannan Somasundaram wrote:

If you want it to be cheaply updatable (or even cheaply
readable),
compression is not what you're going to do.

regards, tom lane

+1..

The visibility map itself is already an example of compression. If
visibility information were randomly distributed among tuples, the
visibility map would be nearly useless.

Regards,
Jeff Davis

In reply to: Jeff Davis (#12)
Re: An idle thought

The visibility map itself is already an example of compression. If
visibility information were randomly distributed among tuples, the
visibility map would be nearly useless.

I believe it is very difficult to make visibility map update friendly

without compromising durability. But such a functionality is very much
wanted in PG still.

#14Jeff Davis
pgsql@j-davis.com
In reply to: Gokulakannan Somasundaram (#13)
Re: An idle thought

On Fri, 2010-03-19 at 01:59 +0530, Gokulakannan Somasundaram wrote:

The visibility map itself is already an example of
compression. If
visibility information were randomly distributed among tuples,
the
visibility map would be nearly useless.

I believe it is very difficult to make visibility map update friendly
without compromising durability. But such a functionality is very
much wanted in PG still.

Surely the VM is already update-friendly. If you update a tuple in a
page with the visibility bit set, the bit must be unset or you will get
wrong results.

Regards,
Jeff Davis

#15Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jeff Davis (#14)
Re: An idle thought

Jeff Davis <pgsql@j-davis.com> writes:

On Fri, 2010-03-19 at 01:59 +0530, Gokulakannan Somasundaram wrote:

I believe it is very difficult to make visibility map update friendly
without compromising durability. But such a functionality is very
much wanted in PG still.

Surely the VM is already update-friendly. If you update a tuple in a
page with the visibility bit set, the bit must be unset or you will get
wrong results.

The VM is (a) not compressed and (b) not correctness-critical.
Wrong bit values don't do any serious damage.

regards, tom lane

#16Jeff Davis
pgsql@j-davis.com
In reply to: Tom Lane (#15)
Re: An idle thought

On Thu, 2010-03-18 at 16:50 -0400, Tom Lane wrote:

The VM is (a) not compressed and (b) not correctness-critical.
Wrong bit values don't do any serious damage.

The VM cause wrong results if a bit is set that's not supposed to be --
right? Am I missing something? How does a seq scan skip visibility
checks and still produce right results, if it doesn't rely on the bit?

The visibility map would obviously not be very useful if visibility
information was randomly distributed among tuples. Whether that
qualifies as "compression" or not was not my primary point. The point is
that it may be possible to use some structure that is significantly
smaller than holding xmin/xmax for every tuple in the heap, and at the
same time may be acceptably fast to update.

Regards,
Jeff Davis

#17Alex Hunsaker
badalex@gmail.com
In reply to: Jeff Davis (#16)
Re: An idle thought

On Thu, Mar 18, 2010 at 15:07, Jeff Davis <pgsql@j-davis.com> wrote:

On Thu, 2010-03-18 at 16:50 -0400, Tom Lane wrote:

The VM is (a) not compressed and (b) not correctness-critical.
Wrong bit values don't do any serious damage.

The VM cause wrong results if a bit is set that's not supposed to be --
right? Am I missing something? How does a seq scan skip visibility
checks and still produce right results, if it doesn't rely on the bit?

Isn't it only really used for VACUUM at this point?

#18Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jeff Davis (#16)
Re: An idle thought

Jeff Davis <pgsql@j-davis.com> writes:

On Thu, 2010-03-18 at 16:50 -0400, Tom Lane wrote:

The VM is (a) not compressed and (b) not correctness-critical.
Wrong bit values don't do any serious damage.

The VM cause wrong results if a bit is set that's not supposed to be --
right? Am I missing something? How does a seq scan skip visibility
checks and still produce right results, if it doesn't rely on the bit?

It doesn't. The only thing we currently rely on the VM for is deciding
whether a page needs vacuuming --- and even that we don't trust it for
when doing anti-wraparound vacuuming. The worst-case consequence of a
wrong bit is failure to free some dead tuples until the vacuum freeze
limit expires.

In order to do things like not visiting a page during scans, we'll have
to solve the reliability issues.

regards, tom lane

#19Greg Stark
stark@mit.edu
In reply to: Jeff Davis (#16)
Re: An idle thought

On Thu, Mar 18, 2010 at 9:07 PM, Jeff Davis <pgsql@j-davis.com> wrote:

The VM cause wrong results if a bit is set that's not supposed to be --
right? Am I missing something? How does a seq scan skip visibility
checks and still produce right results, if it doesn't rely on the bit?

There's also a PD_ALL_VISIBLE flag on the page header. We wal log when
we clear that bit so we can trust if it's set then all the tuples
really are visible. I forget whether we can trust it if it's *not* set
but there's not much point -- all tuples could become visible
spontaneously even the page is sitting on disk.
--
greg

#20Jeff Davis
pgsql@j-davis.com
In reply to: Tom Lane (#18)
Re: An idle thought

On Thu, 2010-03-18 at 17:17 -0400, Tom Lane wrote:

The VM cause wrong results if a bit is set that's not supposed to be --
right? Am I missing something? How does a seq scan skip visibility
checks and still produce right results, if it doesn't rely on the bit?

It doesn't. The only thing we currently rely on the VM for is deciding
whether a page needs vacuuming

Oh, my mistake. I misremembered the discussion and I thought the seq
scan optimization made it in.

In order to do things like not visiting a page during scans, we'll have
to solve the reliability issues.

Yeah, and also for the index-only scans.

Regards,
Jeff Davis

#21Jeff Davis
pgsql@j-davis.com
In reply to: Jeff Davis (#20)
Re: An idle thought

On Thu, 2010-03-18 at 14:48 -0700, Jeff Davis wrote:

On Thu, 2010-03-18 at 17:17 -0400, Tom Lane wrote:

The VM cause wrong results if a bit is set that's not supposed to be --
right? Am I missing something? How does a seq scan skip visibility
checks and still produce right results, if it doesn't rely on the bit?

It doesn't. The only thing we currently rely on the VM for is deciding
whether a page needs vacuuming

Oh, my mistake. I misremembered the discussion and I thought the seq
scan optimization made it in.

Hmm...

From heapgetpage() in heapam.c, CVS HEAD:

/*
* If the all-visible flag indicates that all tuples on the page
are
* visible to everyone, we can skip the per-tuple visibility tests.
But

I tested in gdb, and it calls HeapTupleSatisfiesMVCC, until I VACUUM a
few times, and then it doesn't call it any more. So, apparently the seq
scan optimization _is_ there. And that means it is correctness-critical.

Regards,
Jeff Davis

#22Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jeff Davis (#21)
Re: An idle thought

Jeff Davis <pgsql@j-davis.com> writes:

I tested in gdb, and it calls HeapTupleSatisfiesMVCC, until I VACUUM a
few times, and then it doesn't call it any more. So, apparently the seq
scan optimization _is_ there. And that means it is correctness-critical.

The page header bit is critical. Not the VM.

regards, tom lane

In reply to: Jeff Davis (#14)
Re: An idle thought

Surely the VM is already update-friendly. If you update a tuple in a
page with the visibility bit set, the bit must be unset or you will get
wrong results.

I was referring in the context of index only scans to skip visibility
checks. I doubt, whether the visibility map feature to skip visibility
checks at the heap can be created without any extra cost to updates/inserts.
When a data is compressed then there is more contention for the same block
and hence would likely affect DMLs. I hope that's what Tom was also
referring to, but not in the visibility map context.

Gokul.

#24Bruce Momjian
bruce@momjian.us
In reply to: Jeff Davis (#4)
Re: An idle thought

Jeff Davis wrote:

On Tue, 2010-03-16 at 15:29 +0000, Greg Stark wrote:

I'm picturing storing a bit in the visibility map indicating that *no*
records are visible in a given page.

I've been thinking for a while that we could store the visibility
information in a structure separate from the heap -- sort of like the
visibility map, but per-tuple and authoritative rather than a per-page
hint.

There are all kinds of challenges there, but it might be worth thinking
about. Visibility information is highly compressible, and requires
constant maintenance (updates, deletes, freezing, etc.). It also might
make it possible to move to 64-bit xids, if we wanted to.

I don't think we want to move to 64-bit xids becasue we would still need
to do vacuum freeze to trim the clog. In fact we do vacuum freeze much
more frequently than required for 32-bit xids for this very reason.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

PG East: http://www.enterprisedb.com/community/nav-pg-east-2010.do

#25Jeff Davis
pgsql@j-davis.com
In reply to: Bruce Momjian (#24)
Re: An idle thought

On Mon, 2010-03-22 at 16:48 -0400, Bruce Momjian wrote:

I don't think we want to move to 64-bit xids becasue we would still need
to do vacuum freeze to trim the clog. In fact we do vacuum freeze much
more frequently than required for 32-bit xids for this very reason.

Good point. I think there are a lot of issues like this, which would
make storing the visibility information separate from the heap a huge
undertaking. Still worth thinking about, though.

Regards,
Jeff Davis