Vacuum only with 20% old tuples

Started by Bruce Momjianover 25 years ago38 messages
#1Bruce Momjian
pgman@candle.pha.pa.us

I suggest that we change vacuum to only move remove tuples if there is
more than 20% expired tuples.

When we do vacuum, we drop all indexes and recreate them.

This fixes the complaint about vacuum slowness when there are many
expired rows in the table. We know this is causes by excessive index
updates. It allows indexes to shrink (Jan pointed this out to me.) And
it fixes the TOAST problem with TOAST values in indexes.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#1)
Re: Vacuum only with 20% old tuples

Bruce Momjian <pgman@candle.pha.pa.us> writes:

I suggest that we change vacuum to only move remove tuples if there is
more than 20% expired tuples.

When we do vacuum, we drop all indexes and recreate them.

This fixes the complaint about vacuum slowness when there are many
expired rows in the table. We know this is causes by excessive index
updates. It allows indexes to shrink (Jan pointed this out to me.) And
it fixes the TOAST problem with TOAST values in indexes.

We can't "drop and recreate" without a solution to the relation
versioning issue (unless you are prepared to accept a nonfunctional
database after a failure partway through index rebuild on a system
table). I think we should do this, but it's not all that simple...

I do not see what your 20% idea has to do with this, though, nor
why it's a good idea. If I've told the thing to vacuum I think
it should vacuum. 20% of a big table could be a lot of megabytes,
and I don't want some arbitrary decision in the code about whether
I can reclaim that space or not.

regards, tom lane

#3Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Tom Lane (#2)
Re: Vacuum only with 20% old tuples

We can't "drop and recreate" without a solution to the relation
versioning issue (unless you are prepared to accept a nonfunctional
database after a failure partway through index rebuild on a system
table). I think we should do this, but it's not all that simple...

I do not see what your 20% idea has to do with this, though, nor
why it's a good idea. If I've told the thing to vacuum I think
it should vacuum. 20% of a big table could be a lot of megabytes,
and I don't want some arbitrary decision in the code about whether
I can reclaim that space or not.

Well, I think we should do a sequential scan before starting vacuum to
find the number of expired rows.

Now that we are removing indexes, doing that to remove a few tuples is a
major waste. The user can not really know if the table is worth
vacuuming in normal use. They are just going to use the default. Now,
I think a FORCE option would be good, or the ability to change the 20%
default.

Remember, commercial db's don't even return unused space if you remove
all the rows in a table. At least Informix doesn't and I am sure there
are others. I like vacuum, but let's not make it do major hurtles for
small gain.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#4Hiroshi Inoue
Inoue@tpf.co.jp
In reply to: Tom Lane (#2)
RE: Vacuum only with 20% old tuples

-----Original Message-----
From: pgsql-hackers-owner@hub.org [mailto:pgsql-hackers-owner@hub.org]On
Behalf Of Tom Lane

Bruce Momjian <pgman@candle.pha.pa.us> writes:

I suggest that we change vacuum to only move remove tuples if there is
more than 20% expired tuples.

When we do vacuum, we drop all indexes and recreate them.

This fixes the complaint about vacuum slowness when there are many
expired rows in the table. We know this is causes by excessive index
updates. It allows indexes to shrink (Jan pointed this out to me.) And
it fixes the TOAST problem with TOAST values in indexes.

We can't "drop and recreate" without a solution to the relation
versioning issue (unless you are prepared to accept a nonfunctional
database after a failure partway through index rebuild on a system
table). I think we should do this, but it's not all that simple...

Is this topic independent of WAL in the first place ?

Regards.

Hiroshi Inoue

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Hiroshi Inoue (#4)
Re: Vacuum only with 20% old tuples

"Hiroshi Inoue" <Inoue@tpf.co.jp> writes:

We can't "drop and recreate" without a solution to the relation
versioning issue (unless you are prepared to accept a nonfunctional
database after a failure partway through index rebuild on a system
table). I think we should do this, but it's not all that simple...

Is this topic independent of WAL in the first place ?

Sure, unless Vadim sees some clever way of using WAL to eliminate
the need for versioned relations. But as far as I've seen in the
discussions, versioned relations are independent of WAL.

Basically what I want here is to build the new index relation as
a new file (set of files, if large) and then atomically commit it
as the new version of the index.

If we only want to solve the problem of rebuilding indexes, it's
probably not necessary to have true versions, because nothing outside
of pg_index refers to an index. You could build a complete new index
(new OID, new pg_class and pg_attribute entries, the whole nine yards)
as a new set of files, and delete the old index, and your commit of
this transaction would atomically replace the index. (Vacuuming
pg_index's own indexes this way might be a tad tricky though...)
But that approach doesn't solve the problem of making a CLUSTER
operation that really works the way it should. So I'd rather see us
put the effort into doing relation versions.

regards, tom lane

#6Hiroshi Inoue
Inoue@tpf.co.jp
In reply to: Tom Lane (#5)
RE: Vacuum only with 20% old tuples

-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]

"Hiroshi Inoue" <Inoue@tpf.co.jp> writes:

We can't "drop and recreate" without a solution to the relation
versioning issue (unless you are prepared to accept a nonfunctional
database after a failure partway through index rebuild on a system
table). I think we should do this, but it's not all that simple...

Is this topic independent of WAL in the first place ?

Sure, unless Vadim sees some clever way of using WAL to eliminate
the need for versioned relations. But as far as I've seen in the
discussions, versioned relations are independent of WAL.

Basically what I want here is to build the new index relation as
a new file (set of files, if large) and then atomically commit it
as the new version of the index.

Hmm,your plan seems to need WAL.
We must postpone to build indexes until the end of tuple moving
in vacuum. Once tuple moving started,the consistency between
heap and indexes would be broken. Currently(without WAL) this
inconsistency could never be recovered in case of rollback.

If we only want to solve the problem of rebuilding indexes, it's
probably not necessary to have true versions, because nothing outside
of pg_index refers to an index. You could build a complete new index
(new OID, new pg_class and pg_attribute entries, the whole nine yards)
as a new set of files, and delete the old index, and your commit of
this transaction would atomically replace the index. (Vacuuming
pg_index's own indexes this way might be a tad tricky though...)

??? Don't pg_class and pg_attribute needs tricky handling either ?
Seems pg_class alone needs to be tricky when we use rel versioning.

Anyway we couldn't rely on indexes of currently vacuuming table.
I don't think it's easy to maintain indexes of pg_class,pg_indexes,
pg_atribute all together properly.

Regards.

Hiroshi Inoue
Inoue@tpf.co.jp

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Hiroshi Inoue (#6)
Re: Vacuum only with 20% old tuples

"Hiroshi Inoue" <Inoue@tpf.co.jp> writes:

Basically what I want here is to build the new index relation as
a new file (set of files, if large) and then atomically commit it
as the new version of the index.

Hmm,your plan seems to need WAL.
We must postpone to build indexes until the end of tuple moving
in vacuum. Once tuple moving started,the consistency between
heap and indexes would be broken. Currently(without WAL) this
inconsistency could never be recovered in case of rollback.

Why? The same commit that makes the new index valid would make the
tuple movements valid. Actually, the way VACUUM currently works,
the tuple movements have been committed before we start freeing index
entries anyway. (One reason VACUUM is so inefficient with indexes
is that there is a peak index usage where there are index entries for
*both* old and new tuple positions. I don't feel a need to change
that, as long as the duplicate entries are in the old index that
we're hoping to get rid of.)

this transaction would atomically replace the index. (Vacuuming
pg_index's own indexes this way might be a tad tricky though...)

??? Don't pg_class and pg_attribute needs tricky handling either ?
Seems pg_class alone needs to be tricky when we use rel versioning.

Could be. I think working through how we handle system tables and
indexes is the key stumbling block we've got to get past to have
versioning. I don't know quite how to do it, yet.

regards, tom lane

#8Hiroshi Inoue
Inoue@tpf.co.jp
In reply to: Tom Lane (#7)
RE: Vacuum only with 20% old tuples

-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]

"Hiroshi Inoue" <Inoue@tpf.co.jp> writes:

Basically what I want here is to build the new index relation as
a new file (set of files, if large) and then atomically commit it
as the new version of the index.

Hmm,your plan seems to need WAL.
We must postpone to build indexes until the end of tuple moving
in vacuum. Once tuple moving started,the consistency between
heap and indexes would be broken. Currently(without WAL) this
inconsistency could never be recovered in case of rollback.

Why? The same commit that makes the new index valid would make the
tuple movements valid.

Oops,I rememered I wasn't correct. Certainly it's not so dangerous as
I wrote. But there remains a possibilty that index tuples would point to
cleaned heap blocks unless we delete index tuples for those heap blocks.
Cleaned blocks would be reused by UPDATE operation.

Regards.

Hiroshi Inoue

#9Noname
JanWieck@t-online.de
In reply to: Tom Lane (#5)
Re: Vacuum only with 20% old tuples

Tom Lane wrote:

"Hiroshi Inoue" <Inoue@tpf.co.jp> writes:

We can't "drop and recreate" without a solution to the relation
versioning issue (unless you are prepared to accept a nonfunctional
database after a failure partway through index rebuild on a system
table). I think we should do this, but it's not all that simple...

Is this topic independent of WAL in the first place ?

Sure, unless Vadim sees some clever way of using WAL to eliminate
the need for versioned relations. But as far as I've seen in the
discussions, versioned relations are independent of WAL.

Basically what I want here is to build the new index relation as
a new file (set of files, if large) and then atomically commit it
as the new version of the index.

What implicitly says we need to vacuum the toast relation
AFTER beeing completely done with the indices - in contranst
to what you said before. Otherwise, the old index (the
active one) would still refer to entries that don't exist any
more.

Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck@Yahoo.com #

#10Andrew McMillan
Andrew@catalyst.net.nz
In reply to: Bruce Momjian (#1)
Re: Vacuum only with 20% old tuples

Tom Lane wrote:

Bruce Momjian <pgman@candle.pha.pa.us> writes:

I suggest that we change vacuum to only move remove tuples if there is
more than 20% expired tuples.

When we do vacuum, we drop all indexes and recreate them.

This fixes the complaint about vacuum slowness when there are many
expired rows in the table. We know this is causes by excessive index
updates. It allows indexes to shrink (Jan pointed this out to me.) And
it fixes the TOAST problem with TOAST values in indexes.

We can't "drop and recreate" without a solution to the relation
versioning issue (unless you are prepared to accept a nonfunctional
database after a failure partway through index rebuild on a system
table). I think we should do this, but it's not all that simple...

I do not see what your 20% idea has to do with this, though, nor
why it's a good idea. If I've told the thing to vacuum I think
it should vacuum. 20% of a big table could be a lot of megabytes,
and I don't want some arbitrary decision in the code about whether
I can reclaim that space or not.

I can see some value in having a _configurable_ threshold %age of
deletes before vacuum kicked in and attempted to shrink table/index
on-disk file sizes. This would let the end-user decide, and 20% is
probably a reasonable default, but if it isn't then changing a default
is easier to do down the track.

I can also see that it could be done with (perhaps) a modification to
VACUUM syntax, say:
VACUUM [VERBOSE] [SHRINK] ...

And I believe that the whole thing will go better if ANALYZE is taken
_out_ of vacuum, as was discussed on this list a month or two ago.

Cheers,
Andrew.
--
_____________________________________________________________________
Andrew McMillan, e-mail: Andrew@cat-it.co.nz
Catalyst IT Ltd, PO Box 10-225, Level 22, 105 The Terrace, Wellington
Me: +64 (21) 635 694, Fax: +64 (4) 499 5596, Office: +64 (4) 499 2267

#11Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Andrew McMillan (#10)
Re: Vacuum only with 20% old tuples

I can see some value in having a _configurable_ threshold %age of
deletes before vacuum kicked in and attempted to shrink table/index
on-disk file sizes. This would let the end-user decide, and 20% is
probably a reasonable default, but if it isn't then changing a default
is easier to do down the track.

I can also see that it could be done with (perhaps) a modification to
VACUUM syntax, say:
VACUUM [VERBOSE] [SHRINK] ...

And I believe that the whole thing will go better if ANALYZE is taken
_out_ of vacuum, as was discussed on this list a month or two ago.

The analayze process no longer locks the table exclusively. It will be
made a separate command in 7.1, though an ANALYZE option will still be
avaiable in VACUUM.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#12The Hermit Hacker
scrappy@hub.org
In reply to: Tom Lane (#2)
Re: Vacuum only with 20% old tuples

On Tue, 11 Jul 2000, Tom Lane wrote:

Bruce Momjian <pgman@candle.pha.pa.us> writes:

I suggest that we change vacuum to only move remove tuples if there is
more than 20% expired tuples.

When we do vacuum, we drop all indexes and recreate them.

This fixes the complaint about vacuum slowness when there are many
expired rows in the table. We know this is causes by excessive index
updates. It allows indexes to shrink (Jan pointed this out to me.) And
it fixes the TOAST problem with TOAST values in indexes.

We can't "drop and recreate" without a solution to the relation
versioning issue (unless you are prepared to accept a nonfunctional
database after a failure partway through index rebuild on a system
table). I think we should do this, but it's not all that simple...

I do not see what your 20% idea has to do with this, though, nor
why it's a good idea. If I've told the thing to vacuum I think
it should vacuum. 20% of a big table could be a lot of megabytes,
and I don't want some arbitrary decision in the code about whether
I can reclaim that space or not.

I wouldn't mind seeing some automagic vacuum happen *if* >20% expired
... but don't understand the limit when I tell it to vacuum either ...

#13The Hermit Hacker
scrappy@hub.org
In reply to: Bruce Momjian (#3)
Re: Vacuum only with 20% old tuples

how about leaving vacuum as is, but extend REINDEX so that it
drops/rebuilds all indices on a TABLE | DATABASE? Or does it do that
now? From reading \h REINDEX, my thought is that it doesn't, but ...

On Tue, 11 Jul 2000, Bruce Momjian wrote:

We can't "drop and recreate" without a solution to the relation
versioning issue (unless you are prepared to accept a nonfunctional
database after a failure partway through index rebuild on a system
table). I think we should do this, but it's not all that simple...

I do not see what your 20% idea has to do with this, though, nor
why it's a good idea. If I've told the thing to vacuum I think
it should vacuum. 20% of a big table could be a lot of megabytes,
and I don't want some arbitrary decision in the code about whether
I can reclaim that space or not.

Well, I think we should do a sequential scan before starting vacuum to
find the number of expired rows.

Now that we are removing indexes, doing that to remove a few tuples is a
major waste. The user can not really know if the table is worth
vacuuming in normal use. They are just going to use the default. Now,
I think a FORCE option would be good, or the ability to change the 20%
default.

Remember, commercial db's don't even return unused space if you remove
all the rows in a table. At least Informix doesn't and I am sure there
are others. I like vacuum, but let's not make it do major hurtles for
small gain.

-- 
Bruce Momjian                        |  http://candle.pha.pa.us
pgman@candle.pha.pa.us               |  (610) 853-3000
+  If your life is a hard drive,     |  830 Blythe Avenue
+  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

Marc G. Fournier ICQ#7615664 IRC Nick: Scrappy
Systems Administrator @ hub.org
primary: scrappy@hub.org secondary: scrappy@{freebsd|postgresql}.org

#14Bruce Momjian
pgman@candle.pha.pa.us
In reply to: The Hermit Hacker (#12)
Re: Vacuum only with 20% old tuples

I do not see what your 20% idea has to do with this, though, nor
why it's a good idea. If I've told the thing to vacuum I think
it should vacuum. 20% of a big table could be a lot of megabytes,
and I don't want some arbitrary decision in the code about whether
I can reclaim that space or not.

I wouldn't mind seeing some automagic vacuum happen *if* >20% expired
... but don't understand the limit when I tell it to vacuum either ...

I am confused by your comment.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#15The Hermit Hacker
scrappy@hub.org
In reply to: Bruce Momjian (#14)
Re: Vacuum only with 20% old tuples

On Wed, 12 Jul 2000, Bruce Momjian wrote:

I do not see what your 20% idea has to do with this, though, nor
why it's a good idea. If I've told the thing to vacuum I think
it should vacuum. 20% of a big table could be a lot of megabytes,
and I don't want some arbitrary decision in the code about whether
I can reclaim that space or not.

I wouldn't mind seeing some automagic vacuum happen *if* >20% expired
... but don't understand the limit when I tell it to vacuum either ...

I am confused by your comment.

Make the backend reasonably intelligent ... periodically do a scan, as
you've suggested would be required for your above 20% idea, and if >20%
are expired records, auto-start a vacuum (settable, of course) ...

#16Bruce Momjian
pgman@candle.pha.pa.us
In reply to: The Hermit Hacker (#15)
Re: Vacuum only with 20% old tuples

On Wed, 12 Jul 2000, Bruce Momjian wrote:

I do not see what your 20% idea has to do with this, though, nor
why it's a good idea. If I've told the thing to vacuum I think
it should vacuum. 20% of a big table could be a lot of megabytes,
and I don't want some arbitrary decision in the code about whether
I can reclaim that space or not.

I wouldn't mind seeing some automagic vacuum happen *if* >20% expired
... but don't understand the limit when I tell it to vacuum either ...

I am confused by your comment.

Make the backend reasonably intelligent ... periodically do a scan, as
you've suggested would be required for your above 20% idea, and if >20%
are expired records, auto-start a vacuum (settable, of course) ...

Would be good if we could to vacuum without locking. We could find a
table when things are mostly idle, and it then.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#17The Hermit Hacker
scrappy@hub.org
In reply to: Bruce Momjian (#16)
Re: Vacuum only with 20% old tuples

On Wed, 12 Jul 2000, Bruce Momjian wrote:

On Wed, 12 Jul 2000, Bruce Momjian wrote:

I do not see what your 20% idea has to do with this, though, nor
why it's a good idea. If I've told the thing to vacuum I think
it should vacuum. 20% of a big table could be a lot of megabytes,
and I don't want some arbitrary decision in the code about whether
I can reclaim that space or not.

I wouldn't mind seeing some automagic vacuum happen *if* >20% expired
... but don't understand the limit when I tell it to vacuum either ...

I am confused by your comment.

Make the backend reasonably intelligent ... periodically do a scan, as
you've suggested would be required for your above 20% idea, and if >20%
are expired records, auto-start a vacuum (settable, of course) ...

Would be good if we could to vacuum without locking. We could find a
table when things are mostly idle, and it then.

Definitely :)

#18Zeugswetter Andreas SB
ZeugswetterA@wien.spardat.at
In reply to: The Hermit Hacker (#17)
AW: Vacuum only with 20% old tuples

We can't "drop and recreate" without a solution to the relation
versioning issue (unless you are prepared to accept a nonfunctional
database after a failure partway through index rebuild on a system
table). I think we should do this, but it's not all that simple...

Is this topic independent of WAL in the first place ?

Sure, unless Vadim sees some clever way of using WAL to eliminate
the need for versioned relations. But as far as I've seen in the
discussions, versioned relations are independent of WAL.

WAL can solve the versioned relations problem.
Remember that a sure new step in postmaster startup will be a rollforward of
the WAL,
since that will have the only sync write of our last txn's. Thus in this
step it can also
do any pending rename or delete of files. If a rename or delete fails we
bail out, since we don't want postmaster running under such circumstances
anyway.

Andreas

#19Hiroshi Inoue
Inoue@tpf.co.jp
In reply to: The Hermit Hacker (#13)
RE: Vacuum only with 20% old tuples

-----Original Message-----
From: pgsql-hackers-owner@hub.org [mailto:pgsql-hackers-owner@hub.org]On
Behalf Of The Hermit Hacker

how about leaving vacuum as is, but extend REINDEX so that it
drops/rebuilds all indices on a TABLE | DATABASE? Or does it do that
now? From reading \h REINDEX, my thought is that it doesn't, but ...

As for user tables,REINDEX could do it already,i.e
REINDEX TABLE table_name FORCE; is possible under psql.
If REINDEX fails,PostgreSQL just ignores the indexes of the table
(i.e Indexscan is never applied) and REINDEX/VACUUM would
recover the state. Yes,VACUUM already has a hidden functionality
to reindex.

As for system indexes,you must shutdown postmaster and
invoke standalone postgres with -P option.
REINDEX DATABASE database_name FORCE; would
reindex(shrink) all system tables of the database.
It may be possible even under postmaster if REINDEX
never fails.

Regards.

Hiroshi Inoue
Inoue@tpf.co.jp

#20Noname
JanWieck@t-online.de
In reply to: Hiroshi Inoue (#19)
Re: Vacuum only with 20% old tuples

Hiroshi Inoue wrote:

-----Original Message-----
From: pgsql-hackers-owner@hub.org [mailto:pgsql-hackers-owner@hub.org]On
Behalf Of The Hermit Hacker

how about leaving vacuum as is, but extend REINDEX so that it
drops/rebuilds all indices on a TABLE | DATABASE? Or does it do that
now? From reading \h REINDEX, my thought is that it doesn't, but ...

As for user tables,REINDEX could do it already,i.e
REINDEX TABLE table_name FORCE; is possible under psql.
If REINDEX fails,PostgreSQL just ignores the indexes of the table
(i.e Indexscan is never applied) and REINDEX/VACUUM would
recover the state. Yes,VACUUM already has a hidden functionality
to reindex.

Sorry, but there seem to be problems with that.

pgsql=# delete from t2;
DELETE 0
pgsql=# vacuum;
VACUUM
pgsql=# reindex table t2 force;
REINDEX
pgsql=# \c
You are now connected to database pgsql as user pgsql.
pgsql=# insert into t2 select * from t1;
FATAL 1: btree: failed to add item to the page
pqReadData() -- backend closed the channel unexpectedly.
This probably means the backend terminated abnormally
before or while processing the request.

Happens too if I don't reconnect to the database between
REINDEX and INSERT. Also if I drop connection and restart
postmaster, so it shouldn't belong to old blocks hanging
aroung in the cache.

The interesting thing is that the btree index get's reset to
2 blocks. Need to dive into...

Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck@Yahoo.com #

#21Hiroshi Inoue
Inoue@tpf.co.jp
In reply to: Noname (#20)
RE: Vacuum only with 20% old tuples

-----Original Message-----
From: Jan Wieck [mailto:JanWieck@t-online.de]

Hiroshi Inoue wrote:

-----Original Message-----
From: pgsql-hackers-owner@hub.org

[mailto:pgsql-hackers-owner@hub.org]On

Behalf Of The Hermit Hacker

how about leaving vacuum as is, but extend REINDEX so that it
drops/rebuilds all indices on a TABLE | DATABASE? Or does it do that
now? From reading \h REINDEX, my thought is that it doesn't, but ...

As for user tables,REINDEX could do it already,i.e
REINDEX TABLE table_name FORCE; is possible under psql.
If REINDEX fails,PostgreSQL just ignores the indexes of the table
(i.e Indexscan is never applied) and REINDEX/VACUUM would
recover the state. Yes,VACUUM already has a hidden functionality
to reindex.

Sorry, but there seem to be problems with that.

pgsql=# delete from t2;
DELETE 0
pgsql=# vacuum;
VACUUM
pgsql=# reindex table t2 force;
REINDEX
pgsql=# \c
You are now connected to database pgsql as user pgsql.
pgsql=# insert into t2 select * from t1;
FATAL 1: btree: failed to add item to the page
pqReadData() -- backend closed the channel unexpectedly.
This probably means the backend terminated abnormally
before or while processing the request.

Happens too if I don't reconnect to the database between
REINDEX and INSERT. Also if I drop connection and restart
postmaster, so it shouldn't belong to old blocks hanging
aroung in the cache.

The interesting thing is that the btree index get's reset to
2 blocks. Need to dive into...

Hmm,couldn't reproduce it here.
What kind of indexes t2 have ?

Anyway the index get's reset to 2 blocks seems reasonable because
t2 is empty.

Regards.

Hiroshi Inoue
Inoue@tpf.co.jp

#22Hiroshi Inoue
Inoue@tpf.co.jp
In reply to: Zeugswetter Andreas SB (#18)
RE: Vacuum only with 20% old tuples

-----Original Message-----
From: pgsql-hackers-owner@hub.org [mailto:pgsql-hackers-owner@hub.org]On
Behalf Of Zeugswetter Andreas SB

We can't "drop and recreate" without a solution to the relation
versioning issue (unless you are prepared to accept a nonfunctional
database after a failure partway through index rebuild on a system
table). I think we should do this, but it's not all that simple...

Is this topic independent of WAL in the first place ?

Sure, unless Vadim sees some clever way of using WAL to eliminate
the need for versioned relations. But as far as I've seen in the
discussions, versioned relations are independent of WAL.

WAL can solve the versioned relations problem.
Remember that a sure new step in postmaster startup will be a
rollforward of
the WAL,
since that will have the only sync write of our last txn's. Thus in this
step it can also
do any pending rename or delete of files.

Hmm,don't you allow DDL commands inside transaction block ?

If we allow DDL commands inside transaction block,WAL couldn't
postpone all rename/unlink operations until the end of transaction
without a resolution of the conflict of table file name.

For the following queries

begin;
drop table t;
create table t (..);
insert into t values (...);
commit;

the old table file of t must vanish(using unlink() etc) before 'create table
t'
unless new file name is different from old one(OID file name would
resolve the conflict in this case).
To unlink/rename the table file immediately isn't a problem for the
rollforward functionality. It seems a problem of rollback functionality.

If a rename or delete fails we
bail out, since we don't want postmaster running under such circumstances
anyway.

No there's a significant difference between the failure of 'delete'
and that of 'rename'. We would have no consistency problem even
though 'delete' fails and wouldn't have to stop postmaster. But we
wouldn't be able to see the renamed relation in case of 'rename'
failure and an excellent(??) dba would have to recover the inconsistency.

Regards.

Hiroshi Inoue
Inoue@tpf.co.jp

#23Zeugswetter Andreas SB
ZeugswetterA@wien.spardat.at
In reply to: Hiroshi Inoue (#22)
AW: Vacuum only with 20% old tuples

WAL can solve the versioned relations problem.
Remember that a sure new step in postmaster startup will be a
rollforward of the WAL,
since that will have the only sync write of our last txn's.
Thus in this step it can also
do any pending rename or delete of files.

Hmm,don't you allow DDL commands inside transaction block ?

If we allow DDL commands inside transaction block,WAL couldn't
postpone all rename/unlink operations until the end of transaction
without a resolution of the conflict of table file name.

It does not postpone anything. WAL only logs what it does:

1. log i now start to rename file
2. rename file
3. log rename successful or abort txn

the old table file of t must vanish(using unlink() etc)
before 'create table t'
unless new file name is different from old one(OID file name would
resolve the conflict in this case).

I was basing my statement on OID filenames being a factum.
I am only arguing against the extra version in the filename.

To unlink/rename the table file immediately isn't a problem for the
rollforward functionality. It seems a problem of rollback
functionality.

only unlink cannot be done immediately a rename can be undone
and thus be executed immediately.

If a rename or delete fails we
bail out, since we don't want postmaster running under such

circumstances

anyway.

No there's a significant difference between the failure of 'delete'
and that of 'rename'. We would have no consistency problem even
though 'delete' fails and wouldn't have to stop postmaster. But we
wouldn't be able to see the renamed relation in case of 'rename'
failure and an excellent(??) dba would have to recover the
inconsistency.

The dba only fixes the underlying problem, like filesystem mounted readonly
or wrong permissions on directory. He then simply starts the postmaster
again,
the rollforward with rename will then succeed.

Andreas

#24Hiroshi Inoue
Inoue@tpf.co.jp
In reply to: Zeugswetter Andreas SB (#23)
RE: Vacuum only with 20% old tuples

-----Original Message-----
From: Zeugswetter Andreas SB

WAL can solve the versioned relations problem.
Remember that a sure new step in postmaster startup will be a
rollforward of the WAL,
since that will have the only sync write of our last txn's.
Thus in this step it can also
do any pending rename or delete of files.

If we allow DDL commands inside transaction block,WAL couldn't
postpone all rename/unlink operations until the end of transaction
without a resolution of the conflict of table file name.

It does not postpone anything. WAL only logs what it does:

1. log i now start to rename file
2. rename file

How do we log *rename* ?
What I've meant by *rename* is to replace an existent table
file by a (e.g. work temp) file. So the old table file would
vanish by renaming. How to save the content for rollback ?
Is it preferable to save the content entirely to WAL log file ?
If *rename* is possible,are OID filenames necessary in the
first place ?

3. log rename successful or abort txn

No there's a significant difference between the failure of 'delete'
and that of 'rename'. We would have no consistency problem even
though 'delete' fails and wouldn't have to stop postmaster. But we
wouldn't be able to see the renamed relation in case of 'rename'
failure and an excellent(??) dba would have to recover the
inconsistency.

The dba only fixes the underlying problem, like filesystem
mounted readonly
or wrong permissions on directory. He then simply starts the postmaster
again,
the rollforward with rename will then succeed.

Mustn't a dba be there to restart the postmaster when *rename*
fails ? We don't need a dba even when *delete* fails after commit.

Hiroshi Inoue
Inoue@tpf.co.jp

#25Zeugswetter Andreas SB
ZeugswetterA@wien.spardat.at
In reply to: Hiroshi Inoue (#24)
AW: Vacuum only with 20% old tuples

It does not postpone anything. WAL only logs what it does:

1. log i now start to rename file
2. rename file

How do we log *rename* ?

No, step 2 was meant to do the actual rename.
You log before and after the actual rename.

What I've meant by *rename* is to replace an existent table
file by a (e.g. work temp) file. So the old table file would
vanish by renaming.

you need to rename the old file first then the new file

How to save the content for rollback ?

with rename of old file

Is it preferable to save the content entirely to WAL log file ?

no, unless you have log space to spare, but not enough data space,
which is imho unlikely (or to be considered later).

If *rename* is possible,are OID filenames necessary in the
first place ?

no

3. log rename successful or abort txn

No there's a significant difference between the failure

of 'delete'

and that of 'rename'. We would have no consistency problem even
though 'delete' fails and wouldn't have to stop postmaster. But we
wouldn't be able to see the renamed relation in case of 'rename'
failure and an excellent(??) dba would have to recover the
inconsistency.

The dba only fixes the underlying problem, like filesystem
mounted readonly
or wrong permissions on directory. He then simply starts

the postmaster

again,
the rollforward with rename will then succeed.

Mustn't a dba be there to restart the postmaster when *rename*
fails ? We don't need a dba even when *delete* fails after commit.

The dba needs to clean up unused file space later,
which leaves the problem for the dba to decide which file is needed
and what not (quite difficult with an oid filename).

But, rename is only supposed to fail if something is wrong with the
filesystem or directory. There will be a lot of other problems,
like creating sort files etc if that is the case.
If a rename fails under normal operation the current transaction is aborted,
but you are correct that there is a problem if previous renames in the same
transaction succeeded, but all further renames (for rollback) fail (is this
likely ?).
In that unlikely case I would bail out, let the dba fix the problem and fix
the db
state at startup rollforward (which can now rename and thus abort the txn).

Andreas

#26Hiroshi Inoue
Inoue@tpf.co.jp
In reply to: Zeugswetter Andreas SB (#25)
RE: Vacuum only with 20% old tuples

-----Original Message-----
From: Zeugswetter Andreas SB [mailto:ZeugswetterA@wien.spardat.at]

It does not postpone anything. WAL only logs what it does:

1. log i now start to rename file
2. rename file

How do we log *rename* ?

No, step 2 was meant to do the actual rename.
You log before and after the actual rename.

What I've meant by *rename* is to replace an existent table
file by a (e.g. work temp) file. So the old table file would
vanish by renaming.

you need to rename the old file first then the new file

How to save the content for rollback ?

with rename of old file

OK,I see. Your idea seems to be close to Vadim's *relink*.
However it isn't clear if *relink* has no problem.
In my test case on cygwin port ,*relink* doesn't work well.
*relink* caused an error if concurrent backends open the
*relink*ing table.

No there's a significant difference between the failure

of 'delete'

and that of 'rename'. We would have no consistency problem even
though 'delete' fails and wouldn't have to stop postmaster. But we
wouldn't be able to see the renamed relation in case of 'rename'
failure and an excellent(??) dba would have to recover the
inconsistency.

The dba only fixes the underlying problem, like filesystem
mounted readonly
or wrong permissions on directory. He then simply starts

the postmaster

again,
the rollforward with rename will then succeed.

Mustn't a dba be there to restart the postmaster when *rename*
fails ? We don't need a dba even when *delete* fails after commit.

The dba needs to clean up unused file space later,
which leaves the problem for the dba to decide which file is needed
and what not (quite difficult with an oid filename).

Yes,I don't want a dba to deal with such an unrelible recovery.
Probably vacuum must be wise enough to remove unremoved
table files before removing dead pg_class tuples.

Regards.

Hiroshi Inoue
Inoue@tpf.co.jp

#27Zeugswetter Andreas SB
ZeugswetterA@wien.spardat.at
In reply to: Hiroshi Inoue (#26)
AW: Vacuum only with 20% old tuples

OK,I see. Your idea seems to be close to Vadim's *relink*.
However it isn't clear if *relink* has no problem.
In my test case on cygwin port ,*relink* doesn't work well.
*relink* caused an error if concurrent backends open the
*relink*ing table.

All actions involving a rename of a file imho involve exclusive access
to the table, thus no concurrent access is allowed/possible.

Andreas

#28Hiroshi Inoue
Inoue@tpf.co.jp
In reply to: Zeugswetter Andreas SB (#27)
Re: AW: Vacuum only with 20% old tuples

Zeugswetter Andreas SB wrote:

OK,I see. Your idea seems to be close to Vadim's *relink*.
However it isn't clear if *relink* has no problem.
In my test case on cygwin port ,*relink* doesn't work well.
*relink* caused an error if concurrent backends open the
*relink*ing table.

All actions involving a rename of a file imho involve exclusive access
to the table, thus no concurrent access is allowed/possible.

Once a table file was opened,it is not closed until the end of the
backend process except due to the shortage of file descriptors.

Even now,cygwin port has a flaw that we have no guarantee to
recreate a table safely after dropping the table. Unfortunately
we couldn't recreate the same named file until all processes
close the file in cygwin.

Regards.

Hiroshi Inoue

#29Zeugswetter Andreas SB
ZeugswetterA@wien.spardat.at
In reply to: Hiroshi Inoue (#28)
AW: AW: Vacuum only with 20% old tuples

All actions involving a rename of a file imho involve exclusive access
to the table, thus no concurrent access is allowed/possible.

Once a table file was opened,it is not closed until the end of the
backend process except due to the shortage of file descriptors.

An open file only references the inode, thus the name can safely
change underneath, no?
Of course on next real access to a renamed file it would
need to be closed and opened anew, since it has the wrong file open.
Looks like it would be good to close files as soon as possible
once they do not correspond to a valid db object anymore.

Even now,cygwin port has a flaw that we have no guarantee to
recreate a table safely after dropping the table. Unfortunately
we couldn't recreate the same named file until all processes
close the file in cygwin.

Even if you rename the old file before creating a new one ?
Or is a rename not possible when another process has the file open ?

Andreas

#30Tom Lane
tgl@sss.pgh.pa.us
In reply to: Zeugswetter Andreas SB (#25)
Re: AW: Vacuum only with 20% old tuples

Zeugswetter Andreas SB <ZeugswetterA@wien.spardat.at> writes:

It does not postpone anything. WAL only logs what it does:

1. log i now start to rename file
2. rename file

How do we log *rename* ?

No, step 2 was meant to do the actual rename.
You log before and after the actual rename.

The trouble with depending on WAL to handle undo of external operations
(ie, filesystem calls as opposed to DB record updates) is that you don't
know all the facts if you have to recover/undo the operation. Consider
the following not-too-implausible example:

begin transaction;
alter table foo rename to bar;
create table foo ...;
delete table bar;
<crash>

If we try to implement this by log entries, we will have log entries
for rename, create, delete. We scan the log and come upon the rename.
Hmm, there's a file foo and no file bar ... looks like the rename didn't
get done, so do it. Ooops.

Related nasty scenarios arise when there's a file there of the right
name but the wrong contents, created by non-database operations.
(Consider the risks involved with a table named "core", for example.)

An even more serious problem is that you can't guarantee to be able
to either repeat or undo an external operation, because there are
factors not under your control (kernel permission checks, existing
files getting in the way, etc). Sure, failures like that are not
very likely, but that'll be cold comfort to someone whose database
gets corrupted because a file rename needs to be undone and can't be.

If *rename* is possible,are OID filenames necessary in the
first place ?

no

I think that OID filenames will be a far more reliable solution to
the alter-table-rename problem than anything that involves trying
to actually rename the files. There are other benefits too, like
not being tied to filesystem-based naming restrictions for tables.

regards, tom lane

#31Zeugswetter Andreas SB
ZeugswetterA@wien.spardat.at
In reply to: Tom Lane (#30)
AW: AW: Vacuum only with 20% old tuples

The trouble with depending on WAL to handle undo of external
operations
(ie, filesystem calls as opposed to DB record updates) is
that you don't
know all the facts if you have to recover/undo the operation.
Consider
the following not-too-implausible example:

begin transaction;
alter table foo rename to bar;
create table foo ...;
delete table bar;
<crash>

If we try to implement this by log entries, we will have log entries
for rename, create, delete.

No, you have to have two for each. One before and one after each execution.

We scan the log and come upon the rename.
Hmm, there's a file foo and no file bar ... looks like the
rename didn't get done, so do it. Ooops.

No again. You come upon "starting rename operation" and then either
no more log records (backend abort)
or
log record "rename succeeded"
or
log record "rename failed" --> transaction abort

In this scenario you can decide what to do without second guessing.

Related nasty scenarios arise when there's a file there of the right
name but the wrong contents, created by non-database operations.
(Consider the risks involved with a table named "core", for example.)

I have always been asking for an extension for different object files
like .dat .idx .tmp ... , but this problem is imho present with any filename
be it oid's or named files.

An even more serious problem is that you can't guarantee to be able
to either repeat or undo an external operation, because there are
factors not under your control (kernel permission checks, existing
files getting in the way, etc). Sure, failures like that are not
very likely, but that'll be cold comfort to someone whose database
gets corrupted because a file rename needs to be undone and can't be.

Imho in such an event I rather have my db bail out (or switch to readonly
mode).
Any further modifications would give me the shivers in such an unstable
environment.
The dba then fixes the environment and restarts the postmaster.

If *rename* is possible,are OID filenames necessary in the
first place ?

no

this was only an answer to the question, not a suggestion.

I think that OID filenames will be a far more reliable solution to
the alter-table-rename problem than anything that involves trying
to actually rename the files. There are other benefits too, like
not being tied to filesystem-based naming restrictions for tables.

I was not arguing against oid filenames in this thread, only against
the addition of a version number in the filename.

Andreas

#32Tom Lane
tgl@sss.pgh.pa.us
In reply to: Zeugswetter Andreas SB (#31)
Re: AW: AW: Vacuum only with 20% old tuples

Zeugswetter Andreas SB <ZeugswetterA@wien.spardat.at> writes:

We scan the log and come upon the rename.
Hmm, there's a file foo and no file bar ... looks like the
rename didn't get done, so do it. Ooops.

No again. You come upon "starting rename operation" and then either
no more log records (backend abort)
or
log record "rename succeeded"
or
log record "rename failed" --> transaction abort

In this scenario you can decide what to do without second guessing.

If there are no more records, then you are reduced to guessing whether
you have to undo the rename or not. If you guess wrong, you leave the
database in a corrupted state.

regards, tom lane

#33Zeugswetter Andreas SB
ZeugswetterA@wien.spardat.at
In reply to: Tom Lane (#32)
AW: AW: AW: Vacuum only with 20% old tuples

Zeugswetter Andreas SB <ZeugswetterA@wien.spardat.at> writes:

We scan the log and come upon the rename.
Hmm, there's a file foo and no file bar ... looks like the
rename didn't get done, so do it. Ooops.

No again. You come upon "starting rename operation" and then either
no more log records (backend abort)
or
log record "rename succeeded"
or
log record "rename failed" --> transaction abort

In this scenario you can decide what to do without second guessing.

If there are no more records, then you are reduced to guessing whether
you have to undo the rename or not. If you guess wrong, you leave the
database in a corrupted state.

If the original filename exists the rename failed else it succeeded.
The backends could not have created a new file of the old name
after "starting rename" beeing last log record.

Andreas

#34Tom Lane
tgl@sss.pgh.pa.us
In reply to: Zeugswetter Andreas SB (#33)
Re: AW: AW: AW: Vacuum only with 20% old tuples

Zeugswetter Andreas SB <ZeugswetterA@wien.spardat.at> writes:

If there are no more records, then you are reduced to guessing whether
you have to undo the rename or not. If you guess wrong, you leave the
database in a corrupted state.

If the original filename exists the rename failed else it succeeded.

That's exactly the unreliable assumption I do not want to make.

The backends could not have created a new file of the old name
after "starting rename" beeing last log record.

So you're assuming that we fsync() the log after *each* item is added?
*Within* a transaction? fsync only at end of xact was the plan,
I believe.

regards, tom lane

#35Zeugswetter Andreas SB
ZeugswetterA@wien.spardat.at
In reply to: Tom Lane (#34)
AW: AW: AW: AW: Vacuum only with 20% old tuples

Zeugswetter Andreas SB <ZeugswetterA@wien.spardat.at> writes:

If there are no more records, then you are reduced to

guessing whether

you have to undo the rename or not. If you guess wrong,

you leave the

database in a corrupted state.

If the original filename exists the rename failed else it succeeded.

That's exactly the unreliable assumption I do not want to make.

The backends could not have created a new file of the old name
after "starting rename" beeing last log record.

So you're assuming that we fsync() the log after *each* item is added?
*Within* a transaction? fsync only at end of xact was the plan,
I believe.

With trx log we imho would not need any fsyncs anymore
(except maybe checkpoints).
We would open the trxlog file with O_SYNC and only do a write
when it is absolutely necessary (buffer overflow, end of trx).
But yes the rename log entries (only those) would need to be
written immediately. Is this a performance issue? I guess not.

Andreas

#36Hiroshi Inoue
Inoue@tpf.co.jp
In reply to: Zeugswetter Andreas SB (#29)
RE: AW: Vacuum only with 20% old tuples

-----Original Message-----
From: Zeugswetter Andreas SB [mailto:ZeugswetterA@wien.spardat.at]

All actions involving a rename of a file imho involve exclusive access
to the table, thus no concurrent access is allowed/possible.

Once a table file was opened,it is not closed until the end of the
backend process except due to the shortage of file descriptors.

An open file only references the inode, thus the name can safely
change underneath, no?

It seems too delicate for me to reply your question correctly.
I don't know the internals of cygwin unfortunately.

Here's a Vadim's *relink* test case in cygwin.

[*relink* to replace a table file by a tmp file]

link(table file, tmp2 file);
fsync(tmp2 file);
unlink(table file);

doesn't fail here even though other processes open the table
file. However any access to the table file seems to be inhibited
while the file is open and

link(tmp file, table file);

fails with the message : Permission denied

fsync(table file);
unlink(tmp file).

However simple *rename* seems to work.

rename(table file,tmp2 file);
rename(tmp file,table file);

seems to cause no error.

Of course on next real access to a renamed file it would
need to be closed and opened anew, since it has the wrong file open.
Looks like it would be good to close files as soon as possible
once they do not correspond to a valid db object anymore.

Sorry I missed this case. Yes this was changed to be handled
correctly before 7.0. Other backends receive relcache invaldation
messages and close the old table files. However other backends
don't receive the message immediately. Backend processes
would never receive the message and keep the table file open
while they are idle. It has been serious for cygwin port.

Regards.

Hiroshi Inoue
Inoue@tpf.co.jp

#37Tom Lane
tgl@sss.pgh.pa.us
In reply to: Zeugswetter Andreas SB (#35)
Re: AW: AW: AW: AW: Vacuum only with 20% old tuples

Zeugswetter Andreas SB <ZeugswetterA@wien.spardat.at> writes:

But yes the rename log entries (only those) would need to be
written immediately. Is this a performance issue? I guess not.

Two fsyncs per rename is probably not a performance killer, but you'd
have to also fsync log entries for file creates and deletes, which
might be a bigger deal.

regards, tom lane

#38Zeugswetter Andreas SB
ZeugswetterA@wien.spardat.at
In reply to: Tom Lane (#37)
AW: AW: AW: AW: AW: Vacuum only with 20% old tuples

But yes the rename log entries (only those) would need to be
written immediately. Is this a performance issue? I guess not.

Two fsyncs per rename is probably not a performance killer, but you'd
have to also fsync log entries for file creates and deletes, which
might be a bigger deal.

I am not sure, but I don't think those are necessary,
only if you want to guarantee no leftover files on backend abort.
I agree that create/delete would be critical to performance.

Andreas