Automatic free space map filling

tgl@sss.pgh.pa.us

almost 20 years ago

In reply to: Peter Eisentraut (#1)

Re: Automatic free space map filling

Peter Eisentraut <peter_e@gmx.net> writes:

We've seen database applications that PostgreSQL simply could not manage
because one would have to vacuum continuously. Perhaps in those
situations one could arrange it that an update (or delete) of a row
registers the space in the free space map right away, on the assumption
that by the time it is up for reuse, the transaction will likely have
committed.

The free-space map is not the hard part of the problem. You still have
to VACUUM --- that is, wait until the dead tuple is not only committed
dead but is certainly dead to all onlooker transactions, and then remove
its index entries as well as the tuple itself. The first part of this
makes it impossible for a transaction to be responsible for vacuuming
its own detritus.

Naturally, this would need to be secured in some way,

The FSM is only a hint anyway --- if it points someone to a page that in
reality does not have adequate free space, nothing bad happens except
for the wasted cycles to visit the page and find that out. See the loop
in RelationGetBufferForTuple().

regards, tom lane

hannu@skype.net

almost 20 years ago

In reply to: Peter Eisentraut (#1)

Re: Automatic free space map filling

Ühel kenal päeval, E, 2006-02-27 kell 19:20, kirjutas Peter Eisentraut:

Something came to my mind today, I'm not sure if it's feasible but I
would like to know opinions on it.

We've seen database applications that PostgreSQL simply could not manage
because one would have to vacuum continuously.

What's wrong with vacuuminng continuously ?

I am running an application, that in fact does vacuum continuously
without any ill effects. A case when things become compliacted, is when
you have one huge table (say 50.000.000 rows) that is updated at a
moderate rate and needs an occasional vacuum + a fast-update table,
which needs continuous vacuum. Due to current implementation of vacuum,
you have to abandon continuous vacuuming during vacuum of bigtable, but
i have written and submitted to "patches" list a patch which allows
vacuums not to block each other out, this is stalled due to Tom's
"unesyness" about its possible hidden effects, but it should be
available from "patches" list to anyone in distress :p

Perhaps in those
situations one could arrange it that an update (or delete) of a row
registers the space in the free space map right away, on the assumption
that by the time it is up for reuse, the transaction will likely have
committed. Naturally, this would need to be secured in some way, for
example a "maybe" bit in the FSM itself or simply checking that the
supposed free space is really free before using it, perhaps combined
with a timeout ("don't consider until 5 seconds from now").

Unfortunately transactions have no knowledge about wallclock time :(

I think with applications that have a more or less constant data volume

----------------
Hannu

alvherre@commandprompt.com

almost 20 years ago

In reply to: Hannu Krosing (#3)

Re: Automatic free space map filling

Hannu Krosing wrote:

Due to current implementation of vacuum,
you have to abandon continuous vacuuming during vacuum of bigtable, but
i have written and submitted to "patches" list a patch which allows
vacuums not to block each other out, this is stalled due to Tom's
"unesyness" about its possible hidden effects, but it should be
available from "patches" list to anyone in distress :p

Do you use it in production? Have you noticed any ill effects?

--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Peter Eisentraut

peter_e@gmx.net

almost 20 years ago

In reply to: Tom Lane (#2)

Re: Automatic free space map filling

Am Montag, 27. Februar 2006 19:42 schrieb Tom Lane:

The free-space map is not the hard part of the problem. You still have
to VACUUM --- that is, wait until the dead tuple is not only committed
dead but is certainly dead to all onlooker transactions, and then remove
its index entries as well as the tuple itself. The first part of this
makes it impossible for a transaction to be responsible for vacuuming
its own detritus.

I'm not sure if I made myself clear. The idea is that you fill the free-space
map early with opportunitistic entries in the hope that most updates and
deletes go through "soon". That is, these entries will be invalid for a
short time but hopefully by the time another write looks at them, the entries
will have become valid. That way you don't actually have to run vacuum on
these deleted rows.

--
Peter Eisentraut
http://developer.postgresql.org/~petere/

tgl@sss.pgh.pa.us

almost 20 years ago

In reply to: Peter Eisentraut (#5)

Re: Automatic free space map filling

Peter Eisentraut <peter_e@gmx.net> writes:

I'm not sure if I made myself clear. The idea is that you fill the free-space
map early with opportunitistic entries in the hope that most updates and
deletes go through "soon". That is, these entries will be invalid for a
short time but hopefully by the time another write looks at them, the entries
will have become valid. That way you don't actually have to run vacuum on
these deleted rows.

How does an optimistic FSM entry avoid the need to run vacuum? All that
will happen is that some backend will visit the page and not find usable
free space.

regards, tom lane

pgman@candle.pha.pa.us

almost 20 years ago

In reply to: Tom Lane (#6)

Re: Automatic free space map filling

Tom Lane wrote:

Peter Eisentraut <peter_e@gmx.net> writes:

I'm not sure if I made myself clear. The idea is that you fill the free-space
map early with opportunitistic entries in the hope that most updates and
deletes go through "soon". That is, these entries will be invalid for a
short time but hopefully by the time another write looks at them, the entries
will have become valid. That way you don't actually have to run vacuum on
these deleted rows.

How does an optimistic FSM entry avoid the need to run vacuum? All that
will happen is that some backend will visit the page and not find usable
free space.

Because the index isn't removed, right? That index thing is what
usually kills us.

--
Bruce Momjian http://candle.pha.pa.us
SRA OSS, Inc. http://www.sraoss.com

+ If your life is a hard drive, Christ can be your backup. +

Peter Eisentraut

peter_e@gmx.net

almost 20 years ago

In reply to: Tom Lane (#6)

Re: Automatic free space map filling

Tom Lane wrote:

How does an optimistic FSM entry avoid the need to run vacuum?

It ensures that all freed tuples are already in the FSM.

--
Peter Eisentraut
http://developer.postgresql.org/~petere/

tgl@sss.pgh.pa.us

almost 20 years ago

In reply to: Peter Eisentraut (#8)

Re: Automatic free space map filling

Peter Eisentraut <peter_e@gmx.net> writes:

Tom Lane wrote:

How does an optimistic FSM entry avoid the need to run vacuum?

It ensures that all freed tuples are already in the FSM.

That has nothing to do with it, because the space isn't actually free
for re-use until vacuum deletes the tuple.

regards, tom lane

#10

alvherre@commandprompt.com

almost 20 years ago

In reply to: Tom Lane (#9)

Re: Automatic free space map filling

Tom Lane wrote:

Peter Eisentraut <peter_e@gmx.net> writes:

Tom Lane wrote:

How does an optimistic FSM entry avoid the need to run vacuum?

It ensures that all freed tuples are already in the FSM.

That has nothing to do with it, because the space isn't actually free
for re-use until vacuum deletes the tuple.

I think the idea is a different "free space map" of sorts, whereby a
transaction that obsoletes a tuple puts its block number in that map. A
transaction that inserts a new tuple goes to the FSM. If nothing is
found, it then goes to the new map. A block returned from that map is
then scanned and any tuple that's no longer visible for anyone is
reused.

The problem with this idea is scanning the block and for each tuple
determine if it's alive. Essentially, we would be folding the "find
dead tuples and compress page" logic, which is currently in vacuum, back
to insert. IMHO this is unacceptable from a performance PoV.

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

#11

hannu@skype.net

almost 20 years ago

In reply to: Alvaro Herrera (#4)

Re: Automatic free space map filling

Ühel kenal päeval, T, 2006-02-28 kell 19:47, kirjutas Alvaro Herrera:

Hannu Krosing wrote:

Due to current implementation of vacuum,
you have to abandon continuous vacuuming during vacuum of bigtable, but
i have written and submitted to "patches" list a patch which allows
vacuums not to block each other out, this is stalled due to Tom's
"unesyness" about its possible hidden effects, but it should be
available from "patches" list to anyone in distress :p

Do you use it in production? Have you noticed any ill effects?

No, I don't run it in production at this time, as I solved the immediate
problem by splitting small and big tables to different databases and
having client applications rewritten accordingly.

I did run a parallel load (queries from log of real database, plus
parallel vacuums on tables) for some time and saw no ill effects there.

I will likely start using it in production on some databases during next
few months as new restructuring of databases brings back the case where
huge and tiny tables are in the same database.

--------------
Hannu

#12

Bernd Helmle

bernd.helmle@oopsware.de

almost 20 years ago

In reply to: Tom Lane (#9)

Re: Automatic free space map filling

On Wed, Mar 01, 2006 at 12:41:01PM -0500, Tom Lane wrote:

Peter Eisentraut <peter_e@gmx.net> writes:

Tom Lane wrote:

How does an optimistic FSM entry avoid the need to run vacuum?

It ensures that all freed tuples are already in the FSM.

That has nothing to do with it, because the space isn't actually free
for re-use until vacuum deletes the tuple.

Hmm, but couldn't such an opportunistic approach be used for another leightweight VACUUM mode in such a
way, that VACUUM could look at a special "Hot Spot" queue, which represents potential candidates for
freeing? Let's call it a 2-phase VACUUM....this would avoid a long running VACUUM run on big tables,
e.g. when tuples gets updated (or deleted) frequently. Just an idea...

Bernd

#13

tgl@sss.pgh.pa.us

almost 20 years ago

In reply to: Alvaro Herrera (#10)

Re: Automatic free space map filling

Alvaro Herrera <alvherre@commandprompt.com> writes:

Tom Lane wrote:

That has nothing to do with it, because the space isn't actually free
for re-use until vacuum deletes the tuple.

I think the idea is a different "free space map" of sorts, whereby a
transaction that obsoletes a tuple puts its block number in that map. A
transaction that inserts a new tuple goes to the FSM. If nothing is
found, it then goes to the new map. A block returned from that map is
then scanned and any tuple that's no longer visible for anyone is
reused.

I thought we had sufficiently destroyed that "reuse a tuple" meme
yesterday. You can't do that: there are too many aspects of the system
design that are predicated on the assumption that dead tuples do not
come back to life. You have to do the full vacuuming bit (index entry
removal, super-exclusive page locking, etc) before you can remove a dead
tuple.

Essentially, we would be folding the "find
dead tuples and compress page" logic, which is currently in vacuum, back
to insert. IMHO this is unacceptable from a performance PoV.

That's the other problem: it's not apparent why pushing work from vacuum
back into foreground processing is a good idea. Especially not why
retail vacuuming of individual tuples will be better than wholesale.

regards, tom lane

#14

jnasby@pervasive.com

almost 20 years ago

In reply to: Tom Lane (#13)

Re: Automatic free space map filling

On Thu, Mar 02, 2006 at 01:01:21AM -0500, Tom Lane wrote:

Essentially, we would be folding the "find
dead tuples and compress page" logic, which is currently in vacuum, back
to insert. IMHO this is unacceptable from a performance PoV.

That's the other problem: it's not apparent why pushing work from vacuum
back into foreground processing is a good idea. Especially not why
retail vacuuming of individual tuples will be better than wholesale.

The problem is that even with vacuum_cost_delay, vacuum is still very
slow and problematic in situations such as a large tables in a heavy
transaction environment. Anything that could help reduce the need for
'traditional' vacuuming could well be a win.

Even so, I think the most productive path to pursue at this time is a
dead-space-map/known-clean-map. Either one is almost guaranteed to
provide benefits. Once we know what good they do we can move forward
from there with further improvements.
--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461

#15

ZeugswetterA@spardat.at

almost 20 years ago

In reply to: Jim C. Nasby (#14)

Re: Automatic free space map filling

I thought we had sufficiently destroyed that "reuse a tuple"
meme yesterday. You can't do that: there are too many
aspects of the system design that are predicated on the
assumption that dead tuples do not come back to life. You
have to do the full vacuuming bit (index entry removal,
super-exclusive page locking, etc) before you can remove a dead tuple.

One more idea I would like to throw in.
Ok, we cannot reuse a dead tuple. Maybe we can reuse the space of a dead
tuple by reducing the tuple to it's header info.
(If you still wanted to be able to locate index entries fast,
you would need to keep indexed columns, but I think we agreed that there
is
no real use)

I think that would be achievable at reasonable cost (since you can avoid
one page IO)
on the page of the currently active tuple (the first page that is
considered).

On this page:
if freespace available
--> use it
elsif freespace available after reducing all dead rows
--> use the freespace with a new slot
else ....

Of course this only works when we still have free slots,
but I think that might not really be an issue.

Andreas

Import Notes

Resolved by subject fallback

#16

Bernd Helmle

mailings@oopsware.de

almost 20 years ago

In reply to: Tom Lane (#9)

Re: Automatic free space map filling

[sorry to everyone if that mail arrives multiple times, but i had
some odd problems with my mail gateway yesterday...]

On Wed, Mar 01, 2006 at 12:41:01PM -0500, Tom Lane wrote:

Peter Eisentraut <peter_e@gmx.net> writes:

Tom Lane wrote:

How does an optimistic FSM entry avoid the need to run vacuum?

It ensures that all freed tuples are already in the FSM.

That has nothing to do with it, because the space isn't actually free
for re-use until vacuum deletes the tuple.

But couldn't such an opportunistic approach be used for
another lightweight VACUUM mode in such a way, that VACUUM could
look at a special "Hot Spot" queue, which represents potential
candidates for freeing? Let's call it a 2-phase VACUUM....this would
avoid a constant long running VACUUM run on big tables, e.g. when
tuples gets updated (or deleted) frequently. Just an idea...

Bernd

#17

hannu@skype.net

almost 20 years ago

In reply to: Zeugswetter Andreas DCP SD (#15)

Re: Automatic free space map filling

Ühel kenal päeval, N, 2006-03-02 kell 09:53, kirjutas Zeugswetter
Andreas DCP SD:

I thought we had sufficiently destroyed that "reuse a tuple"
meme yesterday. You can't do that: there are too many
aspects of the system design that are predicated on the
assumption that dead tuples do not come back to life. You
have to do the full vacuuming bit (index entry removal,
super-exclusive page locking, etc) before you can remove a dead tuple.

One more idea I would like to throw in.
Ok, we cannot reuse a dead tuple. Maybe we can reuse the space of a dead
tuple by reducing the tuple to it's header info.
(If you still wanted to be able to locate index entries fast,
you would need to keep indexed columns, but I think we agreed that there
is
no real use)

I don't even think you need the header, just truncate the slot to be
0-size (the next pointer is the same as this one or make the pointer
point to unaligned byte or smth) and detect this condition when
accessing tuples. this would add on compare to all accesse to the tuple,
but I suspect that mostly it is a noop performance-wise as all data
needed is already available in level1 cache.

This would decouple declaring a tuple to be dead/reuse data space and
final cleanup/free index space.

--------------------
Hannu

#18

Christopher Browne

cbbrowne@acm.org

almost 20 years ago

In reply to: Peter Eisentraut (#1)

Re: Automatic free space map filling

Centuries ago, Nostradamus foresaw when tgl@sss.pgh.pa.us (Tom Lane) would write:

I thought we had sufficiently destroyed that "reuse a tuple" meme
yesterday. You can't do that: there are too many aspects of the system
design that are predicated on the assumption that dead tuples do not
come back to life.

This discussion needs to come up again in October when the zombie
movies come out :-).

That's the other problem: it's not apparent why pushing work from
vacuum back into foreground processing is a good idea. Especially
not why retail vacuuming of individual tuples will be better than
wholesale.

What is unclear to me in the discussion is whether or not this is
invalidating the item on the TODO list...

-------------------
Create a bitmap of pages that need vacuuming

Instead of sequentially scanning the entire table, have the background
writer or some other process record pages that have expired rows, then
VACUUM can look at just those pages rather than the entire table. In
the event of a system crash, the bitmap would probably be
invalidated. One complexity is that index entries still have to be
vacuumed, and doing this without an index scan (by using the heap
values to find the index entry) might be slow and unreliable,
especially for user-defined index functions.
-------------------

It strikes me as a non-starter to draw vacuum work directly into the
foreground; there is a *clear* loss in that the death of the tuple
can't actually take place at that point, due to MVCC and the fact that
it is likely that other transactions will be present, keeping the
tuple from being destroyed.

But it would *seem* attractive to do what is in the TODO, above.
Alas, the user defined index functions make cleanout of indexes much
more troublesome :-(. But what's in the TODO is still "wholesale,"
albeit involving more targetted selling than the usual Kirby VACUUM
:-).
--
select 'cbbrowne' || '@' || 'gmail.com';
http://linuxdatabases.info/info/rdbms.html
Rules of the Evil Overlord #140. "I will instruct my guards when
checking a cell that appears empty to look for the chamber pot. If the
chamber pot is still there, then the prisoner has escaped and they may
enter and search for clues. If the chamber pot is not there, then
either the prisoner is perched above the lintel waiting to strike them
with it or else he decided to take it as a souvenir (in which case he
is obviously deeply disturbed and poses no threat). Either way,
there's no point in entering." <http://www.eviloverlord.com/>

#19

Martijn van Oosterhout

kleptog@svana.org

almost 20 years ago

In reply to: Christopher Browne (#18)

Re: Automatic free space map filling

On Thu, Mar 02, 2006 at 08:33:46AM -0500, Christopher Browne wrote:

What is unclear to me in the discussion is whether or not this is
invalidating the item on the TODO list...

-------------------
Create a bitmap of pages that need vacuuming

<snip>

I think this is doable, and not invalidated by anything said so far.
All this is changeing is whether to scan the whole table or just the
bits changed. Unfortunatly I don't think you can avoid scanning the
indexes :(.

Note, for this purpose you don't need to keep a bit per page. The
OS I/O system will load 64k+ (8+ pages) in one go so one bit per 8
pages would be sufficient.

The inverse is keep a list of pages where we know all tuples are
visible to everyone. I'm not sure if this can be done race condition
free. ISTM it would be possible to get the new Bitmap Index Scans to
avoid checking visiblity straight away but wait until it has been
AND/OR'd with other bitmaps and only at the end checking visibility.
But maybe that already happens...

Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
tool for doing 5% of the work and then sitting around waiting for someone
else to do the other 95% so you can sue them.

#20

tgl@sss.pgh.pa.us

almost 20 years ago

In reply to: Hannu Krosing (#17)

Re: Automatic free space map filling

Hannu Krosing <hannu@skype.net> writes:

Ühel kenal päeval, N, 2006-03-02 kell 09:53, kirjutas Zeugswetter
Andreas DCP SD:

Ok, we cannot reuse a dead tuple. Maybe we can reuse the space of a dead
tuple by reducing the tuple to it's header info.

I don't even think you need the header, just truncate the slot to be
0-size

I think you must keep the header because the tuple might be part of an
update chain (cf vacuuming bugs we repaired just a few months ago).
t_ctid is potentially interesting data even in a certainly-dead tuple.

Andreas' idea is possibly doable but I am not sure that I see the point.
It does not reduce the need for vacuum nor the I/O load imposed by
vacuum. What it does do is bias the system in the direction of
allocating an unreasonably large number of tuple line pointers on a page
(ie, more than are useful when the page is fully packed with normal
tuples). Since we never reclaim such pointers, over time all the pages
in a table would tend to develop line-pointer-bloat. I don't know what
the net overhead would be, but it'd definitely impose some aggregate
inefficiency.

regards, tom lane

#21

tgl@sss.pgh.pa.us

almost 20 years ago

In reply to: Bernd Helmle (#16)

Re: Automatic free space map filling

Bernd Helmle <mailings@oopsware.de> writes:

But couldn't such an opportunistic approach be used for
another lightweight VACUUM mode in such a way, that VACUUM could
look at a special "Hot Spot" queue, which represents potential
candidates for freeing?

The proposed dirty-page bit map seems a superior solution to that.

regards, tom lane

#22

tgl@sss.pgh.pa.us

almost 20 years ago

In reply to: Christopher Browne (#18)

Re: Automatic free space map filling

Christopher Browne <cbbrowne@acm.org> writes:

What is unclear to me in the discussion is whether or not this is
invalidating the item on the TODO list...

No, I don't think any of this is an argument against the
dirty-page-bitmap idea. The amount of foreground effort needed to set a
dirty-page bit is minimal (maybe even zero, if we can make the bgwriter
do it, though I'm pretty suspicious of that idea because I think it
needs to be done immediately when the page is dirtied). I don't see the
dirty-page bitmap as changing the way that VACUUM works in any
fundamental respect --- it will just allow the vacuum process to skip
reading pages that certainly don't need to change.

One point that does need to be considered though is what about
anti-wraparound processing (ie, replacing old XIDs with FrozenXID before
they wrap around)? VACUUM currently is a safe way to handle that,
but if its normal mode of operation stops looking at every tuple then
we're going to have an issue there.

regards, tom lane

#23

http://archives.postgresql.org/pgsql-hackers/2006-02/msg01127.php

pgman@candle.pha.pa.us

almost 20 years ago

In reply to: Christopher Browne (#18)

Re: Automatic free space map filling

Christopher Browne wrote:

What is unclear to me in the discussion is whether or not this is
invalidating the item on the TODO list...

-------------------
Create a bitmap of pages that need vacuuming

Instead of sequentially scanning the entire table, have the background
writer or some other process record pages that have expired rows, then
VACUUM can look at just those pages rather than the entire table. In
the event of a system crash, the bitmap would probably be
invalidated. One complexity is that index entries still have to be
vacuumed, and doing this without an index scan (by using the heap
values to find the index entry) might be slow and unreliable,
especially for user-defined index functions.
-------------------

It strikes me as a non-starter to draw vacuum work directly into the
foreground; there is a *clear* loss in that the death of the tuple
can't actually take place at that point, due to MVCC and the fact that
it is likely that other transactions will be present, keeping the
tuple from being destroyed.

But it would *seem* attractive to do what is in the TODO, above.
Alas, the user defined index functions make cleanout of indexes much
more troublesome :-(. But what's in the TODO is still "wholesale,"
albeit involving more targetted selling than the usual Kirby VACUUM
:-).

What bothers me about the TODO item is that if we have to sequentially
scan indexes, are we really gaining much by not having to sequentially
scan the heap? If the heap is large enough to gain from a bitmap, the
index is going to be large too. Is disabling per-index cleanout for
expression indexes the answer?

The entire expression index problem is outlined in this thread:

I don't think it is a show-stopper because if we fail to find the index
that matches the heap, we know we have a problem and can report it and
fall back to an index scan.

Anyway, as I remember, if you have a 20gig table, a vacuum / sequential
scan is painful, but if we have to sequential scan the all indexes, that
is probably just as painful. If we can't make headway there and we
can't cleanout indexes without an sequential index scan, I think we
should just remove the TODO item and give up on improving vacuum
performance.

For the bitmaps, index-only scans require a bit that says "all page
tuples are visible" while vacuum wants "some tuples are expired".
DELETE would clear both bits, while INSERT would clear just the first,
and update is a mix of INSERT and UPDATE, though perhaps on different
pages.

--
Bruce Momjian http://candle.pha.pa.us
SRA OSS, Inc. http://www.sraoss.com

+ If your life is a hard drive, Christ can be your backup. +

#24

pgman@candle.pha.pa.us

almost 20 years ago

In reply to: Tom Lane (#22)

Re: Automatic free space map filling

Tom Lane wrote:

Christopher Browne <cbbrowne@acm.org> writes:

What is unclear to me in the discussion is whether or not this is
invalidating the item on the TODO list...

No, I don't think any of this is an argument against the
dirty-page-bitmap idea. The amount of foreground effort needed to set a
dirty-page bit is minimal (maybe even zero, if we can make the bgwriter
do it, though I'm pretty suspicious of that idea because I think it
needs to be done immediately when the page is dirtied). I don't see the
dirty-page bitmap as changing the way that VACUUM works in any
fundamental respect --- it will just allow the vacuum process to skip
reading pages that certainly don't need to change.

See the email I just posted. I am questioning how big a win it is to
skip heap pages if we have to sequentially scan all indexes.

One point that does need to be considered though is what about
anti-wraparound processing (ie, replacing old XIDs with FrozenXID before
they wrap around)? VACUUM currently is a safe way to handle that,
but if its normal mode of operation stops looking at every tuple then
we're going to have an issue there.

We would need to do sequential scan occasionally and somehow track that.

--
Bruce Momjian http://candle.pha.pa.us
SRA OSS, Inc. http://www.sraoss.com

+ If your life is a hard drive, Christ can be your backup. +

#25

nagy@ecircle-ag.com

almost 20 years ago

In reply to: Bruce Momjian (#23)

Re: Automatic free space map filling

What bothers me about the TODO item is that if we have to sequentially
scan indexes, are we really gaining much by not having to sequentially
scan the heap? If the heap is large enough to gain from a bitmap, the
index is going to be large too. Is disabling per-index cleanout for
expression indexes the answer?

I guess you're saying that full index scan should only be done when the
index is a functional one, and use index lookup for safe indexes ? That
would be a huge win for most of my vacuum-problematic tables, as I don't
have any functional indexes. But I guess full index scan would still be
faster if the percentage of pages changed is more than some threshold.
On the other hand it would allow very frequent vacuuming even for huge
tables so that situation should not occur. Autovacuum thresholds could
be lowered drastically in that case...

Anyway, as I remember, if you have a 20gig table, a vacuum / sequential
scan is painful, but if we have to sequential scan the all indexes, that
is probably just as painful. If we can't make headway there and we
can't cleanout indexes without an sequential index scan, I think we
should just remove the TODO item and give up on improving vacuum
performance.

From my POV, there must be a way to speed up vacuums on huge tables and

small percentage of to-be-vacuumed tuples... a 200 million rows table
with frequent updates of the _same_ record is causing me some pain right
now. I would like to have that table vacuumed as often as possible, but
right now it only works to do it once per week due to load problems on
long-running transactions preventing vacuuming other tables.

Cheers,
Csaba.

#26

pgman@candle.pha.pa.us

almost 20 years ago

In reply to: Csaba Nagy (#25)

Re: Automatic free space map filling

Csaba Nagy wrote:

What bothers me about the TODO item is that if we have to sequentially
scan indexes, are we really gaining much by not having to sequentially
scan the heap? If the heap is large enough to gain from a bitmap, the
index is going to be large too. Is disabling per-index cleanout for
expression indexes the answer?

I guess you're saying that full index scan should only be done when the
index is a functional one, and use index lookup for safe indexes ? That
would be a huge win for most of my vacuum-problematic tables, as I don't
have any functional indexes. But I guess full index scan would still be
faster if the percentage of pages changed is more than some threshold.
On the other hand it would allow very frequent vacuuming even for huge
tables so that situation should not occur. Autovacuum thresholds could
be lowered drastically in that case...

Right. Another idea would be to remove the heap space held by expired
rows, but to keep the tid slot in place because it is pointed to by an
index. The index entry could be recycled by a later vacuum index scan,
or if an index lookup finds such an entry. Because of multiple indexes,
I don't think the tid slot can be removed except by sequential index
scans of all indexes.

There is also the concern that updating the single-page bitmap will
cause contention by multiple sessions modifing a table.

I am thinking as long as we have to sequential-scan every index, we
aren't going to improve vacuum performance dramatically.

If the bitmap adds contention, and it is only a marginal improvement, it
might not be a win.

The bitmap can be a win, but I think we have to think more boldly to
ensure it is a win.

--
Bruce Momjian http://candle.pha.pa.us
SRA OSS, Inc. http://www.sraoss.com

+ If your life is a hard drive, Christ can be your backup. +

#27

ZeugswetterA@spardat.at

almost 20 years ago

In reply to: Bruce Momjian (#26)

Re: Automatic free space map filling

I think you must keep the header because the tuple might be
part of an update chain (cf vacuuming bugs we repaired just a
few months ago).
t_ctid is potentially interesting data even in a certainly-dead tuple.

yes, I'd still want to keep the full header.

Andreas' idea is possibly doable but I am not sure that I see
the point.
It does not reduce the need for vacuum nor the I/O load
imposed by vacuum. What it does do is bias the system in the
direction of allocating an unreasonably large number of tuple
line pointers on a page (ie, more than are useful when the
page is fully packed with normal tuples).

Since we never
reclaim such pointers, over time all the pages in a table
would tend to develop line-pointer-bloat. I don't know what
the net overhead would be, but it'd definitely impose some
aggregate inefficiency.

Ok, for vacuum the slot would look like any other dead row and thus be
target for removal.

Why do we not truncate the line pointer array ?
Is it, that vacuum (not the "full" version) does not move
rows to other pages or slots ? Of course vacuum full could do it,
but I see your point.
Maybe we could impose an upper limit on the number of slots to allow,
after which the optimization is turned off.
But this starts to sound not so good :-(

Andreas

Import Notes

Resolved by subject fallback

#28

tgl@sss.pgh.pa.us

almost 20 years ago

In reply to: Zeugswetter Andreas DCP SD (#27)

Re: Automatic free space map filling

"Zeugswetter Andreas DCP SD" <ZeugswetterA@spardat.at> writes:

Why do we not truncate the line pointer array ?
Is it, that vacuum (not the "full" version) does not move
rows to other pages or slots ? Of course vacuum full could do it,
but I see your point.

We can't reassign tuple TIDs safely except in vacuum full. It's
possible that a plain vacuum could safely truncate off unused line
pointers at the end of the array, but in the absence of a forcing
function to make those pointers become unused, I'm not sure it'd help
much.

regards, tom lane

#29

matthew@zeut.net

almost 20 years ago

In reply to: Csaba Nagy (#25)

Re: Automatic free space map filling

Csaba Nagy wrote

From my POV, there must be a way to speed up vacuums on huge tables and
small percentage of to-be-vacuumed tuples... a 200 million rows table
with frequent updates of the _same_ record is causing me some pain right
now. I would like to have that table vacuumed as often as possible, but
right now it only works to do it once per week due to load problems on
long-running transactions preventing vacuuming other tables.

Are you running 8.1? If so, you can use autovacuum and set per table
thresholds (read vacuum aggressivly) and per table cost delay settings
so that the performance impact is minimal. If you have tried 8.1
autovacuum and found it unhelpful, I would be curious to find out why.

Matt

#30

nagy@ecircle-ag.com

almost 20 years ago

In reply to: Matthew T. O'Connor (#29)

Re: Automatic free space map filling

Are you running 8.1? If so, you can use autovacuum and set per table
thresholds (read vacuum aggressivly) and per table cost delay settings
so that the performance impact is minimal. If you have tried 8.1
autovacuum and found it unhelpful, I would be curious to find out why.

Yes, I'm running 8.1, and I've set up per table auto-vacuum settings :-)
And I lowered the general thresholds too. Generally autovacuum is very
useful from my POV, and in particular the per table settings are so.

But the problem I have is not the performance impact of the vacuum
itself, but the impact of the long running transaction of vacuuming big
tables. I do have big tables which are frequently updated and small
tables which are basically queue tables, so each inserted row will be
updated a few times and then deleted. Those queue tables tend to get
huge unvacuumable dead space during any long running transaction, and
vacuum on the big tables is such a long running transaction. And I have
a few of them, and one is in particular very busy (a task table, all
activities go through that one).

Now when the queue tables get 1000 times dead space compared to their
normal size, I get performance problems. So tweaking vacuum cost delay
doesn't buy me anything, as not vacuum per se is the performance
problem, it's long run time for big tables is.

Cheers,
Csaba.

#31

alvherre@commandprompt.com

almost 20 years ago

In reply to: Csaba Nagy (#30)

Re: Automatic free space map filling

Csaba Nagy wrote:

Now when the queue tables get 1000 times dead space compared to their
normal size, I get performance problems. So tweaking vacuum cost delay
doesn't buy me anything, as not vacuum per se is the performance
problem, it's long run time for big tables is.

So for you it would certainly help a lot to be able to vacuum the first
X pages of the big table, stop, release locks, create new transaction,
continue with the next X pages, lather, rinse, repeat.

This is perfectly doable, it only needs enough motivation from a
knowledgeable person.

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

#32

pgman@candle.pha.pa.us

almost 20 years ago

In reply to: Alvaro Herrera (#31)

Re: Automatic free space map filling

Alvaro Herrera wrote:

Csaba Nagy wrote:

Now when the queue tables get 1000 times dead space compared to their
normal size, I get performance problems. So tweaking vacuum cost delay
doesn't buy me anything, as not vacuum per se is the performance
problem, it's long run time for big tables is.

So for you it would certainly help a lot to be able to vacuum the first
X pages of the big table, stop, release locks, create new transaction,
continue with the next X pages, lather, rinse, repeat.

But what about index clearing? When do you scan each index?

--
Bruce Momjian http://candle.pha.pa.us
SRA OSS, Inc. http://www.sraoss.com

+ If your life is a hard drive, Christ can be your backup. +

#33

alvherre@commandprompt.com

almost 20 years ago

In reply to: Bruce Momjian (#32)

Re: Automatic free space map filling

Bruce Momjian wrote:

Alvaro Herrera wrote:

Csaba Nagy wrote:

Now when the queue tables get 1000 times dead space compared to their
normal size, I get performance problems. So tweaking vacuum cost delay
doesn't buy me anything, as not vacuum per se is the performance
problem, it's long run time for big tables is.

So for you it would certainly help a lot to be able to vacuum the first
X pages of the big table, stop, release locks, create new transaction,
continue with the next X pages, lather, rinse, repeat.

But what about index clearing? When do you scan each index?

At the end of each iteration (or earlier, depending on
maintenance_work_mem). So for each iteration you would need to scan the
indexes.

Maybe we could make maintenance_work_mem be the deciding factor; after
scanning the indexes, do the release/reacquire locks cycle.

--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

#34

Martijn van Oosterhout

kleptog@svana.org

almost 20 years ago

In reply to: Alvaro Herrera (#31)

Re: Automatic free space map filling

On Fri, Mar 03, 2006 at 11:40:40AM -0300, Alvaro Herrera wrote:

Csaba Nagy wrote:

Now when the queue tables get 1000 times dead space compared to their
normal size, I get performance problems. So tweaking vacuum cost delay
doesn't buy me anything, as not vacuum per se is the performance
problem, it's long run time for big tables is.

So for you it would certainly help a lot to be able to vacuum the first
X pages of the big table, stop, release locks, create new transaction,
continue with the next X pages, lather, rinse, repeat.

I think the issue is that even for that small section, you still need
to scan all the indexes to delete the tuples there. So you actually
cause more work because you have to scan the indexes for each portion
of the table rather than just at the end.

However, if this were combined with some optimistic index deletion
code where the tuple was used to find the entry directly rather than
via bulkdelete, maybe it'd be doable. More overall I/O due to the index
lookups but the transactions become shorter. I say optimistic because
if you don't find the tuple the quick way you can always queue it for a
bulkdelete later. Hopefully it will be the uncommon case.

Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
tool for doing 5% of the work and then sitting around waiting for someone
else to do the other 95% so you can sue them.

#35

pgman@candle.pha.pa.us

almost 20 years ago

In reply to: Alvaro Herrera (#33)

Re: Automatic free space map filling

Alvaro Herrera wrote:

Bruce Momjian wrote:

Alvaro Herrera wrote:

Csaba Nagy wrote:

Now when the queue tables get 1000 times dead space compared to their
normal size, I get performance problems. So tweaking vacuum cost delay
doesn't buy me anything, as not vacuum per se is the performance
problem, it's long run time for big tables is.

So for you it would certainly help a lot to be able to vacuum the first
X pages of the big table, stop, release locks, create new transaction,
continue with the next X pages, lather, rinse, repeat.

But what about index clearing? When do you scan each index?

At the end of each iteration (or earlier, depending on
maintenance_work_mem). So for each iteration you would need to scan the
indexes.

Maybe we could make maintenance_work_mem be the deciding factor; after
scanning the indexes, do the release/reacquire locks cycle.

Ewe. How expensive is scanning an index compared to the heap? Does
anyone have figure on that in terms of I/O and time?

--
Bruce Momjian http://candle.pha.pa.us
SRA OSS, Inc. http://www.sraoss.com

+ If your life is a hard drive, Christ can be your backup. +

#36

nagy@ecircle-ag.com

almost 20 years ago

In reply to: Bruce Momjian (#35)

Re: Automatic free space map filling

Ewe. How expensive is scanning an index compared to the heap? Does
anyone have figure on that in terms of I/O and time?

See this post for an example:
http://archives.postgresql.org/pgsql-performance/2006-02/msg00416.php

For my 200 million table, scanning the pk index took ~ 4 hours. And then
there are some more indexes...

So if the index has to be scanned completely, that's still too much.

Cheers,
Csaba.

#37

ZeugswetterA@spardat.at

almost 20 years ago

In reply to: Csaba Nagy (#36)

Re: Automatic free space map filling

But what about index clearing? When do you scan each index?

At the end of each iteration (or earlier, depending on
maintenance_work_mem). So for each iteration you would need
to scan the indexes.

Maybe we could make maintenance_work_mem be the deciding
factor; after scanning the indexes, do the release/reacquire
locks cycle.

But you could do the indexes first and remember how far you can
vacuum the heap later.

So you might as well do each index separately first and remember
how far you can go with the heap for each one.
Then do the heap with a special restriction that comes from what you
remembered from the indexes.
You can now separate the heap vacuum in arbitrarily large transactions,
since the indexes are already taken care of.

(You only vacuum to the point of the eldest vacuumed index)

Andreas

Import Notes

Resolved by subject fallback

#38

pgman@candle.pha.pa.us

almost 20 years ago

In reply to: Zeugswetter Andreas DCP SD (#37)

Re: Automatic free space map filling

Zeugswetter Andreas DCP SD wrote:

But what about index clearing? When do you scan each index?

At the end of each iteration (or earlier, depending on
maintenance_work_mem). So for each iteration you would need
to scan the indexes.

Maybe we could make maintenance_work_mem be the deciding
factor; after scanning the indexes, do the release/reacquire
locks cycle.

But you could do the indexes first and remember how far you can
vacuum the heap later.

So you might as well do each index separately first and remember
how far you can go with the heap for each one.
Then do the heap with a special restriction that comes from what you
remembered from the indexes.
You can now separate the heap vacuum in arbitrarily large transactions,
since the indexes are already taken care of.

(You only vacuum to the point of the eldest vacuumed index)

I thought you need to know the heap tids to remove from the index, so
how do you do the index first?

--
Bruce Momjian http://candle.pha.pa.us
SRA OSS, Inc. http://www.sraoss.com

+ If your life is a hard drive, Christ can be your backup. +

#39

tgl@sss.pgh.pa.us

almost 20 years ago

In reply to: Alvaro Herrera (#31)

Re: Automatic free space map filling

Alvaro Herrera <alvherre@commandprompt.com> writes:

So for you it would certainly help a lot to be able to vacuum the first
X pages of the big table, stop, release locks, create new transaction,
continue with the next X pages, lather, rinse, repeat.

This is perfectly doable, it only needs enough motivation from a
knowledgeable person.

Bruce and I were discussing this the other day; it'd be pretty easy to
make plain VACUUM start a fresh transaction immediately after it
finishes a scan heap/clean indexes/clean heap cycle. The infrastructure
for this (in particular, session-level locks that won't be lost by
closing the xact) is all there. You'd have to figure out how often to
start a new xact ... every cycle is probably too often, at least for
smaller maintenance_work_mem settings ... but it'd not be hard or
involve any strange changes in system semantics.

regards, tom lane

#40

pgman@candle.pha.pa.us

almost 20 years ago

In reply to: Tom Lane (#39)

Re: Automatic free space map filling

Tom Lane wrote:

Alvaro Herrera <alvherre@commandprompt.com> writes:

So for you it would certainly help a lot to be able to vacuum the first
X pages of the big table, stop, release locks, create new transaction,
continue with the next X pages, lather, rinse, repeat.

This is perfectly doable, it only needs enough motivation from a
knowledgeable person.

Bruce and I were discussing this the other day; it'd be pretty easy to
make plain VACUUM start a fresh transaction immediately after it
finishes a scan heap/clean indexes/clean heap cycle. The infrastructure
for this (in particular, session-level locks that won't be lost by
closing the xact) is all there. You'd have to figure out how often to
start a new xact ... every cycle is probably too often, at least for
smaller maintenance_work_mem settings ... but it'd not be hard or
involve any strange changes in system semantics.

Oh, reading the original posting, these are cases where
maintenance_work_mem is full and we are going to rescan the indexes
multiple times anyway for this table.

--
Bruce Momjian http://candle.pha.pa.us
SRA OSS, Inc. http://www.sraoss.com

+ If your life is a hard drive, Christ can be your backup. +

#41

matthew@zeut.net

almost 20 years ago

In reply to: Alvaro Herrera (#31)

Re: Automatic free space map filling

Alvaro Herrera wrote:

Csaba Nagy wrote:

Now when the queue tables get 1000 times dead space compared to their
normal size, I get performance problems. So tweaking vacuum cost delay
doesn't buy me anything, as not vacuum per se is the performance
problem, it's long run time for big tables is.

So for you it would certainly help a lot to be able to vacuum the first
X pages of the big table, stop, release locks, create new transaction,
continue with the next X pages, lather, rinse, repeat.

I got the impression that Csaba is looking more for "multiple
simultaneous vacuum" more than the partial vacuum. Not sure the best
way to set this up, but perhaps a flag in the pg_autovacuum table that
says "vacuum this table even if there is another vacuum running" that
way you can control things and not have autovacuum firing off lots of
vacuums at the same time. Sounds to me that these frequently updated
queue tables need to be monitored closely and not ignored for a long
period of time because we are vacuuming another table. Has anyone
looked more closely at the multiple vacuum patch that was submitted to
the patches list a while ago?

Matt

#42

alvherre@commandprompt.com

almost 20 years ago

In reply to: Matthew T. O'Connor (#41)

Re: Automatic free space map filling

Matthew T. O'Connor wrote:

Alvaro Herrera wrote:

Csaba Nagy wrote:

Now when the queue tables get 1000 times dead space compared to their
normal size, I get performance problems. So tweaking vacuum cost delay
doesn't buy me anything, as not vacuum per se is the performance
problem, it's long run time for big tables is.

So for you it would certainly help a lot to be able to vacuum the first
X pages of the big table, stop, release locks, create new transaction,
continue with the next X pages, lather, rinse, repeat.

I got the impression that Csaba is looking more for "multiple
simultaneous vacuum" more than the partial vacuum.

So he rather needs Hannu Krosing's patch for simultaneous vacuum ...

--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

#43

nagy@ecircle-ag.com

almost 20 years ago

In reply to: Alvaro Herrera (#42)

Re: Automatic free space map filling

I got the impression that Csaba is looking more for "multiple
simultaneous vacuum" more than the partial vacuum.

So he rather needs Hannu Krosing's patch for simultaneous vacuum ...

Well, I guess that would be a good solution to the "queue table"
problem. The problem is that I can't deploy that patch on our production
systems without being fairly sure it won't corrupt any data... and I
can't rely on non-production testing either. Basically I'm waiting to
see Tom saying it will fly :-)

Cheers,
Csaba.

#44

matthew@zeut.net

almost 20 years ago

In reply to: Csaba Nagy (#43)

Re: Automatic free space map filling

Csaba Nagy wrote:

So he rather needs Hannu Krosing's patch for simultaneous vacuum ...

Well, I guess that would be a good solution to the "queue table"
problem. The problem is that I can't deploy that patch on our production
systems without being fairly sure it won't corrupt any data... and I
can't rely on non-production testing either. Basically I'm waiting to
see Tom saying it will fly :-)

That patch is a step forward if it's deemed OK by the powers that be.
However, autovacuum would still need to be taught to handle simultaneous
vacuums. I suppose that in the interim, you could disable autovacuum
for the problematic queue table and have cron issue a manual vacuum
command for that table at the required frequency.

Anyone up for working on / testing / improving Hannu's patch? I think
it's beyond my skill set.

Matt

#45

tgl@sss.pgh.pa.us

almost 20 years ago

In reply to: Matthew T. O'Connor (#44)

Re: Automatic free space map filling

"Matthew T. O'Connor" <matthew@zeut.net> writes:

That patch is a step forward if it's deemed OK by the powers that be.
However, autovacuum would still need to be taught to handle simultaneous
vacuums. I suppose that in the interim, you could disable autovacuum
for the problematic queue table and have cron issue a manual vacuum
command for that table at the required frequency.

I'm not sure you should think of that as an "interim" solution. I don't
really like the idea of multiple autovacuums running concurrently. ISTM
autovac is intended to be something that lurks in the background and
doesn't take up an unreasonable percentage of your system bandwidth ...
but if there's more than one of them, it's going to be mighty hard to
control the overall load penalty. Plus you have to worry about keeping
them off each others' backs, ie, not all trying to vac the same table at
once. And in a scenario like Csaba's, I think the hotspot tables are
just exactly what they'd all try to vacuum.

For small hotspot tables I think a scheduled vacuum process is just the
thing, whereas autovac is more of a free-lance thing to keep the rest of
your DB in line.

regards, tom lane

#46

matthew@zeut.net

almost 20 years ago

In reply to: Tom Lane (#45)

Re: Automatic free space map filling

Tom Lane wrote:

"Matthew T. O'Connor" <matthew@zeut.net> writes:

That patch is a step forward if it's deemed OK by the powers that be.
However, autovacuum would still need to be taught to handle simultaneous
vacuums. I suppose that in the interim, you could disable autovacuum
for the problematic queue table and have cron issue a manual vacuum
command for that table at the required frequency.

I'm not sure you should think of that as an "interim" solution. I don't
really like the idea of multiple autovacuums running concurrently. ISTM
autovac is intended to be something that lurks in the background and
doesn't take up an unreasonable percentage of your system bandwidth ...
but if there's more than one of them, it's going to be mighty hard to
control the overall load penalty. Plus you have to worry about keeping
them off each others' backs, ie, not all trying to vac the same table at
once. And in a scenario like Csaba's, I think the hotspot tables are
just exactly what they'd all try to vacuum.

For small hotspot tables I think a scheduled vacuum process is just the
thing, whereas autovac is more of a free-lance thing to keep the rest of
your DB in line.

While I agree that given the current state of affairs the cron solution
is elegant, I personally want autovac to solve all of our vacuuming
needs, I really dislike the idea of requiring a cron based solution to
solve a fairly typical problem. Besides the cron solution is sloppy, it
blindly vacuums whether it's needed or not resulting in a net increase
of cycles spent vacuuming.

Anyway, I don't know the best way to implement it but I wasn't thinking
of just firing off multiple autovac processes. I was envisioning
something like an autovacuum master process that launches (forks?)
VACUUM commands and has some smarts about how many processes to fire
off, or that it would only fire off simultaneous VACUUMS for tables that
have been flagged as hot spot tables.

I recognize that teaching autovac to handle simultaneous VACUUM's in a
sane way will require a quantum leap of complexity but it still seems a
better long term solution. I would agree that using cron makes sense if
we were seeing lots of different scenarios that we couldn't possibly
anticipate, but I don't think that is where we are.

BTW, this discussion is only relevant if we allow simultaneous vacuum.
Is this something you see as inevitable whether or not you think Hannu's
implementation is acceptable.

Matt

#47

jnasby@pervasive.com

almost 20 years ago

In reply to: Tom Lane (#20)

Re: Automatic free space map filling

On Thu, Mar 02, 2006 at 10:05:28AM -0500, Tom Lane wrote:

Hannu Krosing <hannu@skype.net> writes:

��hel kenal p��eval, N, 2006-03-02 kell 09:53, kirjutas Zeugswetter
Andreas DCP SD:

Ok, we cannot reuse a dead tuple. Maybe we can reuse the space of a dead
tuple by reducing the tuple to it's header info.

Andreas' idea is possibly doable but I am not sure that I see the point.
It does not reduce the need for vacuum nor the I/O load imposed by
vacuum. What it does do is bias the system in the direction of
allocating an unreasonably large number of tuple line pointers on a page
(ie, more than are useful when the page is fully packed with normal
tuples). Since we never reclaim such pointers, over time all the pages
in a table would tend to develop line-pointer-bloat. I don't know what
the net overhead would be, but it'd definitely impose some aggregate
inefficiency.

What would be involved in reclaiming item pointer space? Is there any
reason it's not done today? (I know I've been bit once by this...)
--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461

#48

jnasby@pervasive.com

almost 20 years ago

In reply to: Martijn van Oosterhout (#19)

Re: Automatic free space map filling

On Thu, Mar 02, 2006 at 03:19:46PM +0100, Martijn van Oosterhout wrote:

Note, for this purpose you don't need to keep a bit per page. The
OS I/O system will load 64k+ (8+ pages) in one go so one bit per 8
pages would be sufficient.

AFAIK that's entirely dependant on the filesystem and how it's created
(and possibly the OS as well). So arbitrarily deciding each bit is 8
pages is a bad idea. I could see allowing for a setting that determins
how many pages per bit, though, but I think we're also getting ahead of
ourselves.
--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461

#49

jnasby@pervasive.com

almost 20 years ago

In reply to: Csaba Nagy (#36)

Re: Automatic free space map filling

On Fri, Mar 03, 2006 at 04:14:41PM +0100, Csaba Nagy wrote:

Ewe. How expensive is scanning an index compared to the heap? Does
anyone have figure on that in terms of I/O and time?

See this post for an example:
http://archives.postgresql.org/pgsql-performance/2006-02/msg00416.php

For my 200 million table, scanning the pk index took ~ 4 hours. And then
there are some more indexes...

So if the index has to be scanned completely, that's still too much.

But how does a scan of the index compare to a scan of the table? For
example, if indexes are 1/5th the size of the table, you can
(theoretically) scan 5 indexes in the same amount of time it takes to
scan the heap. That indicates to me that even if we did have to scan all
indexes, a dirty page bitmap would still be a win over the current
situation. But it appears that it should be safe to do index lookups on
indexes that aren't expressions. And I believe that we could take steps
down the road to allow for index lookups on indexes that only used
functions that were known to be safe.
--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461

#50

jnasby@pervasive.com

almost 20 years ago

In reply to: Tom Lane (#39)

Re: Automatic free space map filling

On Fri, Mar 03, 2006 at 11:37:00AM -0500, Tom Lane wrote:

Bruce and I were discussing this the other day; it'd be pretty easy to
make plain VACUUM start a fresh transaction immediately after it
finishes a scan heap/clean indexes/clean heap cycle. The infrastructure
for this (in particular, session-level locks that won't be lost by
closing the xact) is all there. You'd have to figure out how often to
start a new xact ... every cycle is probably too often, at least for
smaller maintenance_work_mem settings ... but it'd not be hard or

If maintenance_work_mem is small you're likely to have poor performance
anyway; I'm suspicious that the overhead of starting a new xact would be
all that important. If you care about performance, you'll probably have
increased maintenance_work_mem anyway.
--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461

#51

Christopher Browne

cbbrowne@acm.org

almost 20 years ago

In reply to: Zeugswetter Andreas DCP SD (#37)

Re: Automatic free space map filling

Centuries ago, Nostradamus foresaw when ZeugswetterA@spardat.at ("Zeugswetter Andreas DCP SD") would write:

But what about index clearing? When do you scan each index?

At the end of each iteration (or earlier, depending on
maintenance_work_mem). So for each iteration you would need
to scan the indexes.

Maybe we could make maintenance_work_mem be the deciding
factor; after scanning the indexes, do the release/reacquire
locks cycle.

But you could do the indexes first and remember how far you can
vacuum the heap later.

But the indexes _can't_ be done first; you _first_ need to know which
tuples are dead, which requires looking at the table itself.
--
select 'cbbrowne' || '@' || 'gmail.com';
http://linuxdatabases.info/info/languages.html
Pound for pound, the amoeba is the most vicious animal on earth.

#52

Md.Abdul Aziz

aziz@students.iiit.net

almost 20 years ago

In reply to: Zeugswetter Andreas DCP SD (#37)

Deleting loid from the database

Hi,
I am a presently working on a module which enhances postgre to
store audio files,while storing the aduido file in the databese i used
liod,now the problem is i am able to unlink but still the data is
present in the postgre database.can some one suggest me how to delete (not
unlink) large objects from the postgre databse.

Thanking you,
Md.Abdul Aziz

#53

Michael Fuhr

mike@fuhr.org

almost 20 years ago

In reply to: Md.Abdul Aziz (#52)

Re: Deleting loid from the database

On Sat, Mar 04, 2006 at 12:08:52PM +0530, Md.Abdul Aziz wrote:

I am a presently working on a module which enhances postgre to
store audio files,while storing the aduido file in the databese i used
liod,now the problem is i am able to unlink but still the data is
present in the postgre database.can some one suggest me how to delete (not
unlink) large objects from the postgre databse.

"VACUUM FULL pg_largeobject" might be what you're looking for, but
if you're going to reload the data then an ordinary VACUUM (without
FULL) will free the space for re-use by PostgreSQL without shrinking
the file (unless the table has no live tuples, in which case the
file size will be zeroed).

--
Michael Fuhr

#54

Ron Mayer

rm_pg@cheapcomplexdevices.com

almost 20 years ago

In reply to: Jim C. Nasby (#48)

Re: Automatic free space map filling

Jim C. Nasby wrote:

... how many pages per bit ...

Are we trying to set up a complex solution to a problem
that'll be mostly moot once partitioning is easier and
partitioned tables are common?

In many cases I can think of the bulk of the data would be in
old partitions that are practically never written to (so would
need no vacuuming and could always use index-only lookups);
while the hot parts of large tables would be on partitions
that would need frequent vacuuming and wouldn't benefit
from index-only lookups.

In these cases, 1 bit per partition would work well,
and seems a lot easier to keep track of than bits-per-page.

#55

Christopher Kings-Lynne

chriskl@familyhealth.com.au

almost 20 years ago

In reply to: Michael Fuhr (#53)

Re: Deleting loid from the database

contrib/vacuumlo perhaps?

Michael Fuhr wrote:

Show quoted text

On Sat, Mar 04, 2006 at 12:08:52PM +0530, Md.Abdul Aziz wrote:

I am a presently working on a module which enhances postgre to
store audio files,while storing the aduido file in the databese i used
liod,now the problem is i am able to unlink but still the data is
present in the postgre database.can some one suggest me how to delete (not
unlink) large objects from the postgre databse.

"VACUUM FULL pg_largeobject" might be what you're looking for, but
if you're going to reload the data then an ordinary VACUUM (without
FULL) will free the space for re-use by PostgreSQL without shrinking
the file (unless the table has no live tuples, in which case the
file size will be zeroed).

#56

Michael Fuhr

mike@fuhr.org

almost 20 years ago

In reply to: Christopher Kings-Lynne (#55)

Re: Deleting loid from the database

On Sun, Mar 05, 2006 at 09:54:02PM +0800, Christopher Kings-Lynne wrote:

contrib/vacuumlo perhaps?

vacuumlo only calls lo_unlink(); the data still exists in pg_largeobject
due to MVCC.

--
Michael Fuhr

#57

ZeugswetterA@spardat.at

almost 20 years ago

In reply to: Ron Mayer (#54)

Re: Automatic free space map filling

But you could do the indexes first and remember how far you

can vacuum

the heap later.

But the indexes _can't_ be done first; you _first_ need to
know which tuples are dead, which requires looking at the
table itself.

If we already had the "all tuples visible" bitmap I think we could
first scan the bitmap and decide whether we can afford to look
at the visibility info for each entry in the index.

We only collect the ctids before so we don't have the inefficient
lookups,
but if we can avoid the lookup in most cases it would again be
attractive.

Andreas

Import Notes

Resolved by subject fallback

#58

pgman@candle.pha.pa.us

almost 20 years ago

In reply to: Zeugswetter Andreas DCP SD (#57)

Re: Automatic free space map filling

Zeugswetter Andreas DCP SD wrote:

But you could do the indexes first and remember how far you

can vacuum

the heap later.

But the indexes _can't_ be done first; you _first_ need to
know which tuples are dead, which requires looking at the
table itself.

If we already had the "all tuples visible" bitmap I think we could
first scan the bitmap and decide whether we can afford to look
at the visibility info for each entry in the index.

We only collect the ctids before so we don't have the inefficient
lookups,
but if we can avoid the lookup in most cases it would again be
attractive.

The problem is that index to heap lookups are very slow.

--
Bruce Momjian http://candle.pha.pa.us
SRA OSS, Inc. http://www.sraoss.com

+ If your life is a hard drive, Christ can be your backup. +

#59

ITAGAKI Takahiro

itagaki.takahiro@lab.ntt.co.jp

almost 20 years ago

In reply to: Zeugswetter Andreas DCP SD (#15)

Re: Automatic free space map filling

"Zeugswetter Andreas DCP SD" <ZeugswetterA@spardat.at> wrote:

Ok, we cannot reuse a dead tuple. Maybe we can reuse the space of a dead
tuple by reducing the tuple to it's header info.

I was just working about your idea. In my work, bgwriter truncates
dead tuples and leaves only their headers. I'll send a concept patch
to PATCHES.

We must take super-exclusive-lock of pages before vacuum. Bgwriter tries to
take exclusive-lock before it writes a page, and does vacuum only if the lock
is super-exclusive. Otherwise, it gives up and writes normally. This is an
optimistic way, but I assume the possibility is high because the most pages
written by bgwriter are least recently used (LRU).

Also, I changed bgwriter_lru_maxpages to be adjusted automatically, because
backends won't do vacuum not to disturb main transaction processing,
so bgwriter should write most of the dirty pages.

There are much room for discussion on this idea.
Comments are welcome.

---
ITAGAKI Takahiro
NTT Cyber Space Laboratories

#60

ITAGAKI Takahiro

itagaki.takahiro@lab.ntt.co.jp

almost 20 years ago

In reply to: Zeugswetter Andreas DCP SD (#15)

1 attachment(s)

Re: [HACKERS] Automatic free space map filling

"Zeugswetter Andreas DCP SD" <ZeugswetterA@spardat.at> wrote:

Ok, we cannot reuse a dead tuple. Maybe we can reuse the space of a dead
tuple by reducing the tuple to it's header info.

Attached patch realizes the concept of his idea. The dead tuples will be
reduced to their headers are done by bgwriter.

This patch is incomplete, so please discuss in the thread on HACKERS.

---
ITAGAKI Takahiro
NTT Cyber Space Laboratories

Attachments:

bgvacuum-0309.patch.txtapplication/octet-stream; name=bgvacuum-0309.patch.txtDownload

diff -cpr pgsql-orig/src/backend/access/heap/heapam.c pgsql/src/backend/access/heap/heapam.c
*** pgsql-orig/src/backend/access/heap/heapam.c	2006-03-09 10:19:20.000000000 +0900
--- pgsql/src/backend/access/heap/heapam.c	2006-03-09 10:20:50.000000000 +0900
*************** heap_restrpos(HeapScanDesc scan)
*** 2742,2755 ****
  XLogRecPtr
  log_heap_clean(Relation reln, Buffer buffer, OffsetNumber *unused, int uncnt)
  {
  	xl_heap_clean xlrec;
  	XLogRecPtr	recptr;
  	XLogRecData rdata[2];
  
! 	/* Caller should not call me on a temp relation */
! 	Assert(!reln->rd_istemp);
! 
! 	xlrec.node = reln->rd_node;
  	xlrec.block = BufferGetBlockNumber(buffer);
  
  	rdata[0].data = (char *) &xlrec;
--- 2742,2761 ----
  XLogRecPtr
  log_heap_clean(Relation reln, Buffer buffer, OffsetNumber *unused, int uncnt)
  {
+ 	/* Caller should not call me on a temp relation */
+ 	Assert(!reln->rd_istemp);
+ 
+ 	return log_heap_clean_rnode(&reln->rd_node, buffer, unused, uncnt);
+ }
+ 
+ XLogRecPtr
+ log_heap_clean_rnode(RelFileNode *rnode, Buffer buffer, OffsetNumber *unused, int uncnt)
+ {
  	xl_heap_clean xlrec;
  	XLogRecPtr	recptr;
  	XLogRecData rdata[2];
  
! 	xlrec.node = *rnode;
  	xlrec.block = BufferGetBlockNumber(buffer);
  
  	rdata[0].data = (char *) &xlrec;
diff -cpr pgsql-orig/src/backend/access/transam/xlog.c pgsql/src/backend/access/transam/xlog.c
*** pgsql-orig/src/backend/access/transam/xlog.c	2006-03-09 10:19:20.000000000 +0900
--- pgsql/src/backend/access/transam/xlog.c	2006-03-09 10:20:50.000000000 +0900
*************** CreateCheckPoint(bool shutdown, bool for
*** 5215,5221 ****
  	CheckPointCLOG();
  	CheckPointSUBTRANS();
  	CheckPointMultiXact();
! 	FlushBufferPool();
  	/* We deliberately delay 2PC checkpointing as long as possible */
  	CheckPointTwoPhase(checkPoint.redo);
  
--- 5215,5221 ----
  	CheckPointCLOG();
  	CheckPointSUBTRANS();
  	CheckPointMultiXact();
! 	FlushBufferPool(shutdown);
  	/* We deliberately delay 2PC checkpointing as long as possible */
  	CheckPointTwoPhase(checkPoint.redo);
  
diff -cpr pgsql-orig/src/backend/commands/dbcommands.c pgsql/src/backend/commands/dbcommands.c
*** pgsql-orig/src/backend/commands/dbcommands.c	2006-03-09 10:19:20.000000000 +0900
--- pgsql/src/backend/commands/dbcommands.c	2006-03-09 10:20:50.000000000 +0900
*************** createdb(const CreatedbStmt *stmt)
*** 360,366 ****
  	 * up-to-date for the copy.  (We really only need to flush buffers for the
  	 * source database, but bufmgr.c provides no API for that.)
  	 */
! 	BufferSync();
  
  	/*
  	 * Once we start copying subdirectories, we need to be able to clean 'em
--- 360,366 ----
  	 * up-to-date for the copy.  (We really only need to flush buffers for the
  	 * source database, but bufmgr.c provides no API for that.)
  	 */
! 	BufferSync(false);
  
  	/*
  	 * Once we start copying subdirectories, we need to be able to clean 'em
*************** dbase_redo(XLogRecPtr lsn, XLogRecord *r
*** 1361,1367 ****
  		 * up-to-date for the copy.  (We really only need to flush buffers for
  		 * the source database, but bufmgr.c provides no API for that.)
  		 */
! 		BufferSync();
  
  		/*
  		 * Copy this subdirectory to the new location
--- 1361,1367 ----
  		 * up-to-date for the copy.  (We really only need to flush buffers for
  		 * the source database, but bufmgr.c provides no API for that.)
  		 */
! 		BufferSync(false);
  
  		/*
  		 * Copy this subdirectory to the new location
diff -cpr pgsql-orig/src/backend/storage/buffer/bufmgr.c pgsql/src/backend/storage/buffer/bufmgr.c
*** pgsql-orig/src/backend/storage/buffer/bufmgr.c	2006-03-09 10:19:20.000000000 +0900
--- pgsql/src/backend/storage/buffer/bufmgr.c	2006-03-09 10:22:41.000000000 +0900
***************
*** 53,58 ****
--- 53,77 ----
  #include "utils/resowner.h"
  #include "pgstat.h"
  
+ #include "access/heapam.h"
+ #include "storage/procarray.h"
+ #include "storage/freespace.h"
+ #include "utils/tqual.h"
+ 
+ 
+ /*#define BGVACUUM_DEBUG*/
+ #define BGVACUUM_STAT
+ 
+ #ifdef BGVACUUM_STAT
+ int	BgVacuumStat_Vacuum = 0;
+ int	BgVacuumStat_Useless = 0;
+ int	BgVacuumStat_Conflict = 0;
+ int	BgVacuumStat_Skip = 0;
+ int	BgVacuumStat_Unknown = 0;
+ #define BGVACUUM_STAT_INC(name)	(++(name))
+ #else /* BGVACUUM_STAT */
+ #define BGVACUUM_STAT_INC(name)	((void)0)
+ #endif /* BGVACUUM_STAT */
  
  /* Note: these two macros only work on shared buffers, not local ones! */
  #define BufHdrGetBlock(bufHdr)	((Block) (BufferBlocks + ((Size) (bufHdr)->buf_id) * BLCKSZ))
*************** double		bgwriter_all_percent = 0.333;
*** 73,78 ****
--- 92,101 ----
  int			bgwriter_lru_maxpages = 5;
  int			bgwriter_all_maxpages = 5;
  
+ bool		bgvacuum_autotune = true;
+ bool		bgvacuum_fsm = true;
+ bool		bgvacuum_relation = true;
+ 
  
  long		NDirectFileRead;	/* some I/O's are direct file access. bypass
  								 * bufmgr */
*************** static bool IsForInput;
*** 86,97 ****
  /* local state for LockBufferForCleanup */
  static volatile BufferDesc *PinCountWaitBuf = NULL;
  
  
  static bool PinBuffer(volatile BufferDesc *buf);
  static void PinBuffer_Locked(volatile BufferDesc *buf);
  static void UnpinBuffer(volatile BufferDesc *buf,
  			bool fixOwner, bool normalAccess);
! static bool SyncOneBuffer(int buf_id, bool skip_pinned);
  static void WaitIO(volatile BufferDesc *buf);
  static bool StartBufferIO(volatile BufferDesc *buf, bool forInput);
  static void TerminateBufferIO(volatile BufferDesc *buf, bool clear_dirty,
--- 109,131 ----
  /* local state for LockBufferForCleanup */
  static volatile BufferDesc *PinCountWaitBuf = NULL;
  
+ typedef enum SyncMode
+ {
+ 	SyncLru,		/* by bgwriter 'lru' writes. */
+ 	SyncAll,		/* by bgwriter 'all' writes. */
+ 	SyncFlush,		/* by checkpoint. */
+ 	SyncShutdown,	/* by shutdown checkpoint. */
+ } SyncMode;
  
  static bool PinBuffer(volatile BufferDesc *buf);
  static void PinBuffer_Locked(volatile BufferDesc *buf);
  static void UnpinBuffer(volatile BufferDesc *buf,
  			bool fixOwner, bool normalAccess);
! static bool SyncOneBuffer(int buf_id, SyncMode mode, TransactionId xmin);
! static bool RelationVacuumPage(volatile BufferDesc *bufHdr,
! 			SyncMode mode, TransactionId xmin);
! static bool RelationDefragPage(BufferDesc *bufHdr, TransactionId xmin);
! static void RelationRecordFreeSpace(BufferDesc *bufHdr);
  static void WaitIO(volatile BufferDesc *buf);
  static bool StartBufferIO(volatile BufferDesc *buf, bool forInput);
  static void TerminateBufferIO(volatile BufferDesc *buf, bool clear_dirty,
*************** BufferAlloc(Relation reln,
*** 507,514 ****
  	oldTag = buf->tag;
  	oldFlags = buf->flags;
  	buf->tag = newTag;
! 	buf->flags &= ~(BM_VALID | BM_DIRTY | BM_JUST_DIRTIED | BM_IO_ERROR);
  	buf->flags |= BM_TAG_VALID;
  	buf->usage_count = 0;
  
  	UnlockBufHdr(buf);
--- 541,553 ----
  	oldTag = buf->tag;
  	oldFlags = buf->flags;
  	buf->tag = newTag;
! 	buf->flags &= ~(BM_VALID | BM_DIRTY | BM_JUST_DIRTIED | BM_IO_ERROR | BM_RELATION);
  	buf->flags |= BM_TAG_VALID;
+ 	if (!reln->rd_istemp && reln->rd_rel &&
+ 		reln->rd_rel->relkind == RELKIND_RELATION)
+ 	{
+ 		buf->flags |= BM_RELATION;
+ 	}
  	buf->usage_count = 0;
  
  	UnlockBufHdr(buf);
*************** UnpinBuffer(volatile BufferDesc *buf, bo
*** 892,907 ****
   * This is called at checkpoint time to write out all dirty shared buffers.
   */
  void
! BufferSync(void)
  {
  	int			buf_id;
  	int			num_to_scan;
  	int			absorb_counter;
  
  	/*
  	 * Find out where to start the circular scan.
  	 */
! 	buf_id = StrategySyncStart();
  
  	/* Make sure we can handle the pin inside SyncOneBuffer */
  	ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
--- 931,954 ----
   * This is called at checkpoint time to write out all dirty shared buffers.
   */
  void
! BufferSync(bool shutdown)
  {
  	int			buf_id;
  	int			num_to_scan;
  	int			absorb_counter;
+ 	int			numGetBuffer;
+ 	TransactionId	xmin;
+ 	SyncMode		mode = (shutdown ? SyncShutdown : SyncFlush);
+ 
+ 	if (bgvacuum_relation)
+ 		xmin = GetOldestXmin(true);
+ 	else
+ 		xmin = InvalidTransactionId;
  
  	/*
  	 * Find out where to start the circular scan.
  	 */
! 	buf_id = StrategySyncStart(&numGetBuffer);
  
  	/* Make sure we can handle the pin inside SyncOneBuffer */
  	ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
*************** BufferSync(void)
*** 913,919 ****
  	absorb_counter = WRITES_PER_ABSORB;
  	while (num_to_scan-- > 0)
  	{
! 		if (SyncOneBuffer(buf_id, false))
  		{
  			/*
  			 * If in bgwriter, absorb pending fsync requests after each
--- 960,966 ----
  	absorb_counter = WRITES_PER_ABSORB;
  	while (num_to_scan-- > 0)
  	{
! 		if (SyncOneBuffer(buf_id, mode, xmin))
  		{
  			/*
  			 * If in bgwriter, absorb pending fsync requests after each
*************** BgBufferSync(void)
*** 944,949 ****
--- 991,1003 ----
  	int			buf_id2;
  	int			num_to_scan;
  	int			num_written;
+ 	int			numGetBuffer;
+ 	TransactionId	xmin;
+ 
+ 	if (bgvacuum_relation)
+ 		xmin = GetOldestXmin(true);
+ 	else
+ 		xmin = InvalidTransactionId;
  
  	/* Make sure we can handle the pin inside SyncOneBuffer */
  	ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
*************** BgBufferSync(void)
*** 975,981 ****
  		{
  			if (++buf_id1 >= NBuffers)
  				buf_id1 = 0;
! 			if (SyncOneBuffer(buf_id1, false))
  			{
  				if (++num_written >= bgwriter_all_maxpages)
  					break;
--- 1029,1035 ----
  		{
  			if (++buf_id1 >= NBuffers)
  				buf_id1 = 0;
! 			if (SyncOneBuffer(buf_id1, SyncAll, xmin))
  			{
  				if (++num_written >= bgwriter_all_maxpages)
  					break;
*************** BgBufferSync(void)
*** 987,1002 ****
  	 * This loop considers only unpinned buffers close to the clock sweep
  	 * point.
  	 */
! 	if (bgwriter_lru_percent > 0.0 && bgwriter_lru_maxpages > 0)
  	{
  		num_to_scan = (int) ((NBuffers * bgwriter_lru_percent + 99) / 100);
  		num_written = 0;
  
! 		buf_id2 = StrategySyncStart();
  
  		while (num_to_scan-- > 0)
  		{
! 			if (SyncOneBuffer(buf_id2, true))
  			{
  				if (++num_written >= bgwriter_lru_maxpages)
  					break;
--- 1041,1066 ----
  	 * This loop considers only unpinned buffers close to the clock sweep
  	 * point.
  	 */
! 	if (bgwriter_lru_percent > 0.0)
  	{
  		num_to_scan = (int) ((NBuffers * bgwriter_lru_percent + 99) / 100);
  		num_written = 0;
  
! 		buf_id2 = StrategySyncStart(&numGetBuffer);
! 		if (bgvacuum_autotune)
! 		{
! 			if (bgwriter_lru_maxpages < numGetBuffer)
! 				bgwriter_lru_maxpages += 1;
! 			else if(bgwriter_lru_maxpages > numGetBuffer)
! 				bgwriter_lru_maxpages -= 1;
! 
! 			/* starts writing a little ahead. */
! 			buf_id2 = (buf_id2 + bgwriter_lru_maxpages) % NBuffers;
! 		}
  
  		while (num_to_scan-- > 0)
  		{
! 			if (SyncOneBuffer(buf_id2, SyncLru, xmin))
  			{
  				if (++num_written >= bgwriter_lru_maxpages)
  					break;
*************** BgBufferSync(void)
*** 1020,1027 ****
   * Note: caller must have done ResourceOwnerEnlargeBuffers.
   */
  static bool
! SyncOneBuffer(int buf_id, bool skip_pinned)
  {
  	volatile BufferDesc *bufHdr = &BufferDescriptors[buf_id];
  
  	/*
--- 1084,1093 ----
   * Note: caller must have done ResourceOwnerEnlargeBuffers.
   */
  static bool
! SyncOneBuffer(int buf_id, SyncMode mode, TransactionId xmin)
  {
+ 	BufFlags			flags;
+ 	bool				noref;
  	volatile BufferDesc *bufHdr = &BufferDescriptors[buf_id];
  
  	/*
*************** SyncOneBuffer(int buf_id, bool skip_pinn
*** 1034,1049 ****
  	 * upcoming changes and so we are not required to write such dirty buffer.
  	 */
  	LockBufHdr(bufHdr);
! 	if (!(bufHdr->flags & BM_VALID) || !(bufHdr->flags & BM_DIRTY))
  	{
  		UnlockBufHdr(bufHdr);
  		return false;
  	}
! 	if (skip_pinned &&
! 		(bufHdr->refcount != 0 || bufHdr->usage_count != 0))
  	{
  		UnlockBufHdr(bufHdr);
! 		return false;
  	}
  
  	/*
--- 1100,1118 ----
  	 * upcoming changes and so we are not required to write such dirty buffer.
  	 */
  	LockBufHdr(bufHdr);
! 
! 	noref = (bufHdr->refcount == 0);
! 	if (mode == SyncLru && (!noref || bufHdr->usage_count != 0))
  	{
  		UnlockBufHdr(bufHdr);
  		return false;
  	}
! 
! 	flags = bufHdr->flags;
! 	if (!(flags & BM_VALID) || !(flags & BM_DIRTY))
  	{
  		UnlockBufHdr(bufHdr);
! 		return mode == SyncLru;
  	}
  
  	/*
*************** SyncOneBuffer(int buf_id, bool skip_pinn
*** 1051,1057 ****
  	 * buffer is clean by the time we've locked it.)
  	 */
  	PinBuffer_Locked(bufHdr);
! 	LWLockAcquire(bufHdr->content_lock, LW_SHARED);
  
  	FlushBuffer(bufHdr, NULL);
  
--- 1120,1145 ----
  	 * buffer is clean by the time we've locked it.)
  	 */
  	PinBuffer_Locked(bufHdr);
! 
! 	if (flags & BM_RELATION)
! 	{
! 		if (bgvacuum_relation && mode != SyncShutdown && noref)
! 		{
! 			if (!RelationVacuumPage(bufHdr, mode, xmin))
! 				return false;
! 		}
! 		else
! 		{
! 			BGVACUUM_STAT_INC( BgVacuumStat_Skip );
! 			LWLockAcquire(bufHdr->content_lock, LW_SHARED);
! 		}
! 	}
! 	else
! 	{
! 		/* none-relation page */
! 		BGVACUUM_STAT_INC( BgVacuumStat_Unknown );
! 		LWLockAcquire(bufHdr->content_lock, LW_SHARED);
! 	}
  
  	FlushBuffer(bufHdr, NULL);
  
*************** PrintBufferLeakWarning(Buffer buffer)
*** 1216,1221 ****
--- 1304,1328 ----
  		 buf->refcount, loccount);
  }
  
+ #ifdef BGVACUUM_STAT
+ static void
+ PrintBgVacuumStat(const char* name)
+ {
+ 	elog(LOG, "BGVACUUM(%s) :\t%d\t%d\t%d\t%d\t%d",
+ 		name,
+ 		BgVacuumStat_Vacuum,
+ 		BgVacuumStat_Useless,
+ 		BgVacuumStat_Conflict,
+ 		BgVacuumStat_Skip,
+ 		BgVacuumStat_Unknown);
+ 	BgVacuumStat_Vacuum = 0;
+ 	BgVacuumStat_Useless = 0;
+ 	BgVacuumStat_Conflict = 0;
+ 	BgVacuumStat_Skip = 0;
+ 	BgVacuumStat_Unknown = 0;
+ }
+ #endif
+ 
  /*
   * FlushBufferPool
   *
*************** PrintBufferLeakWarning(Buffer buffer)
*** 1224,1232 ****
   * flushed.
   */
  void
! FlushBufferPool(void)
  {
! 	BufferSync();
  	smgrsync();
  }
  
--- 1331,1349 ----
   * flushed.
   */
  void
! FlushBufferPool(bool shutdown)
  {
! #ifdef BGVACUUM_STAT
! 	elog(LOG, "AUTOLRU :\t%d", bgwriter_lru_maxpages);
! 	PrintBgVacuumStat("background");
! #endif
! 
! 	BufferSync(shutdown);
! 
! #ifdef BGVACUUM_STAT
! 	PrintBgVacuumStat("checkpoint");
! #endif
! 
  	smgrsync();
  }
  
*************** buffer_write_error_callback(void *arg)
*** 2118,2120 ****
--- 2235,2405 ----
  				   bufHdr->tag.rnode.dbNode,
  				   bufHdr->tag.rnode.relNode);
  }
+ 
+ /*
+  * RelationVacuumPage -- vacuum one relation page
+  *
+  *	@param bufHdr	Must be pinned.
+  *	@param mode		Sync mode.
+  *	@param xmin		Vacuum threshould.
+  *	@result	Should write? If true, bufHdr is shared locked.
+  */
+ static bool
+ RelationVacuumPage(volatile BufferDesc *bufHdr,
+ 	SyncMode mode, TransactionId xmin)
+ {
+ 	BufferDesc *buf;
+ 	bool		hasSuperLock;
+ 
+ 	/* We can vacuum the page only if the super exclusive lock is held. */
+ 
+ 	if (!LWLockConditionalAcquire(bufHdr->content_lock, LW_EXCLUSIVE))
+ 	{
+ 		BGVACUUM_STAT_INC( BgVacuumStat_Conflict );
+ 		LWLockAcquire(bufHdr->content_lock, LW_SHARED);
+ 		return true;
+ 	}
+ 
+ 	LockBufHdr(bufHdr);
+ 	hasSuperLock = (bufHdr->refcount == 1); /* pinned only by me */
+ 	UnlockBufHdr(bufHdr);
+ 
+ 	if (!hasSuperLock)
+ 	{
+ 		LWLockDowngrade(bufHdr->content_lock);
+ 		BGVACUUM_STAT_INC( BgVacuumStat_Conflict );
+ 		return true;
+ 	}
+ 
+ 	buf = /* volatile_cast */ (BufferDesc *) bufHdr;
+ 
+ 	if (!RelationDefragPage(buf, xmin))
+ 	{
+ 		/* no dead tuples */
+ 		LWLockDowngrade(bufHdr->content_lock);
+ 		BGVACUUM_STAT_INC( BgVacuumStat_Useless );
+ 		return true;
+ 	}
+ 
+ 	/* vacuum was performed */
+ 	BGVACUUM_STAT_INC( BgVacuumStat_Vacuum );
+ 	switch (mode)
+ 	{
+ 		case SyncLru:
+ 		case SyncAll:
+ 			/* record free space and skip this write */
+ 			LWLockRelease(bufHdr->content_lock);
+ 			RelationRecordFreeSpace(buf);	/* XXX: should call after release pin? */
+ 			UnpinBuffer(bufHdr, true, true /* prevent reuse for a while */);
+ 			return false;
+ 		case SyncFlush:
+ 			/* record free space and flush */
+ 			LWLockDowngrade(bufHdr->content_lock);
+ 			RelationRecordFreeSpace(buf);	/* XXX: should call after release lock and pin? */
+ 			return true;
+ 		default:
+ 			elog(ERROR, "unexpected SyncMode %d", mode);
+ 			return false;	/* keep compiler quiet */
+ 	}
+ }
+ 
+ /*
+  * RelationDefragPage -- vacuum one relation page
+  *
+  *	@result	Vacuum performed?
+  */
+ static bool
+ RelationDefragPage(BufferDesc *bufHdr, TransactionId xmin)
+ {
+ 	Buffer			buffer = BufferDescriptorGetBuffer(bufHdr);
+ 	Page			page = (Page) BufHdrGetBlock(bufHdr);
+ 	OffsetNumber	offnum, maxoff;
+ 	int				num_kill = 0;
+ #ifdef BGVACUUM_DEBUG
+ 	int				num_alive = 0;
+ 	int				num_dead = 0;
+ #endif
+ 	OffsetNumber	unused[MaxOffsetNumber];
+ 	int				uncnt;
+ 	XLogRecPtr		recptr;
+ 
+ 	if (PageIsNew(page) || PageIsEmpty(page))
+ 		return false;
+ 
+ 	maxoff = PageGetMaxOffsetNumber(page);
+ 	for (offnum = FirstOffsetNumber;
+ 		 offnum <= maxoff;
+ 		 offnum = OffsetNumberNext(offnum))
+ 	{
+ 		ItemId itemid = PageGetItemId(page, offnum);
+ 		HeapTupleHeader tuple;
+ 
+ 		if (!ItemIdIsUsed(itemid))
+ 			continue;
+ 
+ 		tuple = (HeapTupleHeader) PageGetItem(page, itemid);
+ 		if (HeapTupleHeaderGetXmax(tuple) == FrozenTransactionId)
+ 		{	/* This tuple has been truncated already. */
+ #ifdef BGVACUUM_DEBUG
+ 			++num_dead;
+ #endif
+ 			continue;
+ 		}
+ 		switch (HeapTupleSatisfiesVacuum(tuple, xmin, buffer))
+ 		{
+ 			/*
+ 			 * We can truncate recently dead tuples, because ctid chains in the headers
+ 			 * still remain.
+ 			 */
+ 			case HEAPTUPLE_DEAD:
+ 			case HEAPTUPLE_RECENTLY_DEAD:
+ 				Assert(!(tuple->t_infomask & HEAP_IS_LOCKED));
+ 				tuple->t_natts = 0;
+ 				tuple->t_infomask &=
+ 					~(HEAP_HASNULL | HEAP_HASVARWIDTH | HEAP_HASEXTENDED | HEAP_HASOID);
+ 				HeapTupleHeaderSetXmin(tuple, FrozenTransactionId);
+ 				HeapTupleHeaderSetXmax(tuple, FrozenTransactionId);
+ 				itemid->lp_len = MAXALIGN(offsetof(HeapTupleHeaderData, t_bits));
+ 				++num_kill;
+ 				break;
+ 			case HEAPTUPLE_LIVE:
+ 			case HEAPTUPLE_INSERT_IN_PROGRESS:
+ 			case HEAPTUPLE_DELETE_IN_PROGRESS:
+ #ifdef BGVACUUM_DEBUG
+ 				++num_alive;
+ #endif
+ 				break;
+ 			default:
+ 				elog(ERROR, "unexpected HeapTupleSatisfiesVacuum result");
+ 		}
+ 	}
+ 
+ #ifdef BGVACUUM_DEBUG
+ 	elog(LOG, "BGVACUUM TABLE: ID=%d, alive=%d, kill=%d, dead=%d",
+ 		buffer, num_alive, num_kill, num_dead);
+ #endif
+ 
+ 	if (num_kill <= 0)
+ 		return false;
+ 
+ 	/* TODO: XLOG stuff for truncation. */
+ 
+ 	uncnt = PageRepairFragmentation(page, unused);
+ 
+ 	/* XLOG stuff for defragmentation. */
+ 	recptr = log_heap_clean_rnode(&bufHdr->tag.rnode, buffer, unused, uncnt);
+ 	PageSetLSN(page, recptr);
+ 	PageSetTLI(page, ThisTimeLineID);
+ 
+ 	return true;
+ }
+ 
+ static void
+ RelationRecordFreeSpace(BufferDesc *bufHdr)
+ {
+ 	if (bgvacuum_fsm)
+ 	{
+ 		Size	space = PageGetFreeSpace((Page) BufHdrGetBlock(bufHdr));
+ 		RecordFreeSpace(&bufHdr->tag.rnode, bufHdr->tag.blockNum, space);
+ 	}
+ }
diff -cpr pgsql-orig/src/backend/storage/buffer/freelist.c pgsql/src/backend/storage/buffer/freelist.c
*** pgsql-orig/src/backend/storage/buffer/freelist.c	2006-03-09 10:19:20.000000000 +0900
--- pgsql/src/backend/storage/buffer/freelist.c	2006-03-09 10:20:50.000000000 +0900
*************** typedef struct
*** 27,32 ****
--- 27,34 ----
  	/* Clock sweep hand: index of next buffer to consider grabbing */
  	int			nextVictimBuffer;
  
+ 	int			numGetBuffer;			/* Count of buffer requests */
+ 
  	int			firstFreeBuffer;	/* Head of list of unused buffers */
  	int			lastFreeBuffer; /* Tail of list of unused buffers */
  
*************** StrategyGetBuffer(void)
*** 63,68 ****
--- 65,72 ----
  
  	LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
  
+ 	StrategyControl->numGetBuffer++;
+ 
  	/*
  	 * Try to get a buffer from the freelist.  Note that the freeNext fields
  	 * are considered to be protected by the BufFreelistLock not the
*************** StrategyFreeBuffer(volatile BufferDesc *
*** 176,182 ****
   * BufferSync() will proceed circularly around the buffer array from there.
   */
  int
! StrategySyncStart(void)
  {
  	int			result;
  
--- 180,186 ----
   * BufferSync() will proceed circularly around the buffer array from there.
   */
  int
! StrategySyncStart(int *numGetBuffer)
  {
  	int			result;
  
*************** StrategySyncStart(void)
*** 186,191 ****
--- 190,197 ----
  	 */
  	LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
  	result = StrategyControl->nextVictimBuffer;
+ 	*numGetBuffer = StrategyControl->numGetBuffer;
+ 	StrategyControl->numGetBuffer = 0;
  	LWLockRelease(BufFreelistLock);
  	return result;
  }
*************** StrategyInitialize(bool init)
*** 263,268 ****
--- 269,276 ----
  
  		/* Initialize the clock sweep pointer */
  		StrategyControl->nextVictimBuffer = 0;
+ 
+ 		StrategyControl->numGetBuffer = 0;
  	}
  	else
  		Assert(!init);
diff -cpr pgsql-orig/src/backend/storage/freespace/freespace.c pgsql/src/backend/storage/freespace/freespace.c
*** pgsql-orig/src/backend/storage/freespace/freespace.c	2006-03-09 10:19:20.000000000 +0900
--- pgsql/src/backend/storage/freespace/freespace.c	2006-03-09 10:20:50.000000000 +0900
***************
*** 70,75 ****
--- 70,76 ----
  #include "storage/lwlock.h"
  #include "storage/shmem.h"
  
+ extern bool	bgvacuum_fsm;
  
  /*----------
   * During database shutdown, we store the contents of FSM into a disk file,
*************** RecordAndGetPageWithFreeSpace(RelFileNod
*** 339,344 ****
--- 340,399 ----
  }
  
  /*
+  * RecordFreeSpace - update info about a page.
+  */
+ void
+ RecordFreeSpace(RelFileNode *rel,
+ 			  BlockNumber page,
+ 			  Size avail)
+ {
+ 	FSMRelation *fsmrel;
+ 
+ 	LWLockAcquire(FreeSpaceLock, LW_EXCLUSIVE);
+ 
+ 	fsmrel = create_fsm_rel(rel);
+ 	if (fsmrel->storedPages > 0)
+ 	{
+ 		/*
+ 		elog(LOG, "RecordFreeSpace(ID=%d, stored=%d, avail=%d/%d)",
+ 			rel->relNode, fsmrel->storedPages, avail, fsmrel->avgRequest);
+ 		*/
+ 		if (avail >= fsmrel->avgRequest)
+ 			fsm_record_free_space(fsmrel, page, avail);
+ 	}
+ 	else
+ 	{
+ 		int			i;
+ 		int			nPages;
+ 		int			curAlloc;
+ 		int			curAllocPages;
+ 		FSMPageData *newLocation;
+ 
+ 		nPages = MaxFSMPages / MaxFSMRelations;
+ 		curAlloc = realloc_fsm_rel(fsmrel, nPages, false);
+ 		curAllocPages = curAlloc * CHUNKPAGES;
+ 
+ 		elog(LOG, "RecordFreeSpace(NEW: ID=%d, stored=%d, avail=%d)",
+ 			rel->relNode, curAllocPages, avail);
+ 
+ 		newLocation = (FSMPageData *)
+ 			(FreeSpaceMap->arena + fsmrel->firstChunk * CHUNKBYTES);
+ 
+ 		FSMPageSetPageNum(newLocation, page);
+ 		FSMPageSetSpace(newLocation, avail);
+ 		newLocation++;
+ 		for (i = 1; i < curAllocPages; i++)
+ 		{
+ 			FSMPageSetPageNum(newLocation, InvalidBlockNumber);
+ 			FSMPageSetSpace(newLocation, 0);
+ 			newLocation++;
+ 		}
+ 		fsmrel->storedPages = curAllocPages;
+ 	}
+ 	LWLockRelease(FreeSpaceLock);
+ }
+ 
+ /*
   * GetAvgFSMRequestSize - get average FSM request size for a relation.
   *
   * If the relation is not known to FSM, return a default value.
*************** RecordRelationFreeSpace(RelFileNode *rel
*** 420,426 ****
  				FSMPageSetSpace(newLocation, avail);
  				newLocation++;
  			}
! 			fsmrel->storedPages = nPages;
  		}
  		else
  		{
--- 475,487 ----
  				FSMPageSetSpace(newLocation, avail);
  				newLocation++;
  			}
! 			for (; i < curAllocPages; i++)
! 			{
! 				FSMPageSetPageNum(newLocation, InvalidBlockNumber);
! 				FSMPageSetSpace(newLocation, 0);
! 				newLocation++;
! 			}
! 			fsmrel->storedPages = curAllocPages;
  		}
  		else
  		{
*************** fsm_record_free_space(FSMRelation *fsmre
*** 1218,1223 ****
--- 1279,1345 ----
  		info += pageIndex;
  		FSMPageSetSpace(info, spaceAvail);
  	}
+ 	else if (bgvacuum_fsm && spaceAvail >= fsmrel->avgRequest)
+ 	{
+ 		int i, begin, end, victim = -1;
+ 		Size minAvail = spaceAvail;
+ 		FSMPageData *info;
+ 		int victim_search_range = 10;
+ 
+ 		info = (FSMPageData *)
+ 			(FreeSpaceMap->arena + fsmrel->firstChunk * CHUNKBYTES);
+ 
+ 		begin = Max(0, pageIndex - victim_search_range);
+ 		end = Min(pageIndex + victim_search_range, fsmrel->storedPages);
+ 
+ 		for (i = begin; i < end; i++)
+ 		{
+ 			Size avail = FSMPageGetSpace(info + i);
+ 			if (avail < minAvail)
+ 			{
+ 				victim = i;
+ 				minAvail = avail;
+ 			}
+ 		}
+ 		if (victim >= 0)
+ 		{
+ #ifdef FREESPACE_DEBUG
+ 			BlockNumber victimPage = FSMPageGetPageNum(info+victim);
+ #endif
+ 			if (victim < pageIndex)
+ 			{
+ 				memmove(info + victim, info + victim + 1, (pageIndex - victim) * sizeof(FSMPageData));
+ 				info += pageIndex - 1;
+ 			}
+ 			else
+ 			{
+ 				memmove(info + pageIndex + 1, info + pageIndex, (victim - pageIndex) * sizeof(FSMPageData));
+ 				info += pageIndex;
+ 			}
+ 			FSMPageSetPageNum(info, page);
+ 			FSMPageSetSpace(info, spaceAvail);
+ 
+ #ifdef FREESPACE_DEBUG
+ 			{
+ 				FSMPageData *p = (FSMPageData *)
+ 					(FreeSpaceMap->arena + fsmrel->firstChunk * CHUNKBYTES);
+ 				for (i = 0; i < fsmrel->storedPages-1; i++)
+ 				{
+ 					if(FSMPageGetPageNum(p+i) > FSMPageGetPageNum(p+i+1))
+ 					{
+ 						elog(WARNING, "INSERTED page=%u, index=%u", page, pageIndex);
+ 						elog(WARNING, "VICTIM   page=%u, index=%u, avail=%u", victimPage, victim, minAvail);
+ 						for (i = 0; i < fsmrel->storedPages; i++)
+ 						{
+ 							elog(WARNING, "[%5d] %5u = %5d", i, FSMPageGetPageNum(p+i), FSMPageGetSpace(p+i));
+ 						}
+ 						elog(FATAL, "FSM corrupted (not sorted)");
+ 					}
+ 				}
+ 			}
+ #endif
+ 		}
+ 	}
  	else
  	{
  		/*
diff -cpr pgsql-orig/src/backend/storage/lmgr/lwlock.c pgsql/src/backend/storage/lmgr/lwlock.c
*** pgsql-orig/src/backend/storage/lmgr/lwlock.c	2006-03-09 10:19:20.000000000 +0900
--- pgsql/src/backend/storage/lmgr/lwlock.c	2006-03-09 10:20:50.000000000 +0900
*************** LWLockHeldByMe(LWLockId lockid)
*** 604,606 ****
--- 604,662 ----
  	}
  	return false;
  }
+ 
+ /*
+  * LWLockDowngrade - downgrade exclusive lock to shared lock
+  */
+ void
+ LWLockDowngrade(LWLockId lockid)
+ {
+ 	volatile LWLock *lock = &(LWLockArray[lockid].lock);
+ 	PGPROC	   *head;
+ 	PGPROC	   *proc;
+ 
+ 	/* TODO: Check lockid is exclusive locked by me */
+ 	PRINT_LWDEBUG("LWLockDowngrade", lockid, lock);
+ 
+ 	/* Acquire mutex.  Time spent holding mutex should be short! */
+ 	SpinLockAcquire(&lock->mutex);
+ 
+ 	Assert (lock->exclusive > 0);
+ 	lock->exclusive--;
+ 	lock->shared++;
+ 
+ 	head = lock->head;
+ 	if (head != NULL)
+ 	{
+ 		if (lock->releaseOK && !head->lwExclusive)
+ 		{
+ 			proc = head;
+ 			while (proc->lwWaitLink != NULL &&
+ 				   !proc->lwWaitLink->lwExclusive)
+ 				proc = proc->lwWaitLink;
+ 			lock->head = proc->lwWaitLink;
+ 			proc->lwWaitLink = NULL;
+ 			lock->releaseOK = false;
+ 		}
+ 		else
+ 		{
+ 			/* lock is still held, can't awaken anything */
+ 			head = NULL;
+ 		}
+ 	}
+ 
+ 	SpinLockRelease(&lock->mutex);
+ 
+ 	/*
+ 	 * Awaken any waiters I removed from the queue.
+ 	 */
+ 	while (head != NULL)
+ 	{
+ 		LOG_LWDEBUG("LWLockDowngrade", lockid, "release waiter");
+ 		proc = head;
+ 		head = proc->lwWaitLink;
+ 		proc->lwWaitLink = NULL;
+ 		proc->lwWaiting = false;
+ 		PGSemaphoreUnlock(&proc->sem);
+ 	}
+ }
diff -cpr pgsql-orig/src/backend/utils/misc/guc.c pgsql/src/backend/utils/misc/guc.c
*** pgsql-orig/src/backend/utils/misc/guc.c	2006-03-09 10:19:20.000000000 +0900
--- pgsql/src/backend/utils/misc/guc.c	2006-03-09 10:20:50.000000000 +0900
*************** static struct config_bool ConfigureNames
*** 1003,1008 ****
--- 1003,1035 ----
  		false, NULL, NULL
  	},
  
+ 	{
+ 		{"bgvacuum_autotune", PGC_SIGHUP, RESOURCES,
+ 			gettext_noop("Enables the bgwriter's auto tuning of bgwriter_lru_maxpages."),
+ 			NULL
+ 		},
+ 		&bgvacuum_autotune,
+ 		true, NULL, NULL
+ 	},
+ 
+ 	{
+ 		{"bgvacuum_fsm", PGC_SIGHUP, RESOURCES,
+ 			gettext_noop("Enables the auto maintenance of free space map."),
+ 			NULL
+ 		},
+ 		&bgvacuum_fsm,
+ 		true, NULL, NULL
+ 	},
+ 
+ 	{
+ 		{"bgvacuum_relation", PGC_SIGHUP, RESOURCES,
+ 			gettext_noop("Enables the bgwriter's vacuum of relations."),
+ 			NULL
+ 		},
+ 		&bgvacuum_relation,
+ 		true, NULL, NULL
+ 	},
+ 
  	/* End-of-list marker */
  	{
  		{NULL, 0, 0, NULL, NULL}, NULL, false, NULL, NULL
diff -cpr pgsql-orig/src/include/access/heapam.h pgsql/src/include/access/heapam.h
*** pgsql-orig/src/include/access/heapam.h	2006-03-09 10:19:20.000000000 +0900
--- pgsql/src/include/access/heapam.h	2006-03-09 10:20:50.000000000 +0900
*************** extern void heap_redo(XLogRecPtr lsn, XL
*** 181,186 ****
--- 181,188 ----
  extern void heap_desc(char *buf, uint8 xl_info, char *rec);
  extern XLogRecPtr log_heap_clean(Relation reln, Buffer buffer,
  			   OffsetNumber *unused, int uncnt);
+ extern XLogRecPtr log_heap_clean_rnode(RelFileNode *rnode, Buffer buffer,
+ 			   OffsetNumber *unused, int uncnt);
  extern XLogRecPtr log_heap_move(Relation reln, Buffer oldbuf,
  			  ItemPointerData from,
  			  Buffer newbuf, HeapTuple newtup);
diff -cpr pgsql-orig/src/include/storage/buf_internals.h pgsql/src/include/storage/buf_internals.h
*** pgsql-orig/src/include/storage/buf_internals.h	2006-03-09 10:19:20.000000000 +0900
--- pgsql/src/include/storage/buf_internals.h	2006-03-09 10:20:50.000000000 +0900
***************
*** 35,40 ****
--- 35,41 ----
  #define BM_IO_ERROR				(1 << 4)		/* previous I/O failed */
  #define BM_JUST_DIRTIED			(1 << 5)		/* dirtied since write started */
  #define BM_PIN_COUNT_WAITER		(1 << 6)		/* have waiter for sole pin */
+ #define BM_RELATION				(1 << 7)		/* relation page */
  
  typedef bits16 BufFlags;
  
*************** extern long int LocalBufferFlushCount;
*** 175,181 ****
  /* freelist.c */
  extern volatile BufferDesc *StrategyGetBuffer(void);
  extern void StrategyFreeBuffer(volatile BufferDesc *buf, bool at_head);
! extern int	StrategySyncStart(void);
  extern Size StrategyShmemSize(void);
  extern void StrategyInitialize(bool init);
  
--- 176,182 ----
  /* freelist.c */
  extern volatile BufferDesc *StrategyGetBuffer(void);
  extern void StrategyFreeBuffer(volatile BufferDesc *buf, bool at_head);
! extern int	StrategySyncStart(int *numGetBuffer);
  extern Size StrategyShmemSize(void);
  extern void StrategyInitialize(bool init);
  
diff -cpr pgsql-orig/src/include/storage/bufmgr.h pgsql/src/include/storage/bufmgr.h
*** pgsql-orig/src/include/storage/bufmgr.h	2006-03-09 10:19:20.000000000 +0900
--- pgsql/src/include/storage/bufmgr.h	2006-03-09 10:20:50.000000000 +0900
*************** extern double bgwriter_all_percent;
*** 32,37 ****
--- 32,41 ----
  extern int	bgwriter_lru_maxpages;
  extern int	bgwriter_all_maxpages;
  
+ extern bool	bgvacuum_autotune;
+ extern bool	bgvacuum_fsm;
+ extern bool	bgvacuum_relation;
+ 
  /* in buf_init.c */
  extern DLLIMPORT char *BufferBlocks;
  extern DLLIMPORT int32 *PrivateRefCount;
*************** extern char *ShowBufferUsage(void);
*** 128,134 ****
  extern void ResetBufferUsage(void);
  extern void AtEOXact_Buffers(bool isCommit);
  extern void PrintBufferLeakWarning(Buffer buffer);
! extern void FlushBufferPool(void);
  extern BlockNumber BufferGetBlockNumber(Buffer buffer);
  extern BlockNumber RelationGetNumberOfBlocks(Relation relation);
  extern void RelationTruncate(Relation rel, BlockNumber nblocks);
--- 132,138 ----
  extern void ResetBufferUsage(void);
  extern void AtEOXact_Buffers(bool isCommit);
  extern void PrintBufferLeakWarning(Buffer buffer);
! extern void FlushBufferPool(bool shutdown);
  extern BlockNumber BufferGetBlockNumber(Buffer buffer);
  extern BlockNumber RelationGetNumberOfBlocks(Relation relation);
  extern void RelationTruncate(Relation rel, BlockNumber nblocks);
*************** extern void LockBufferForCleanup(Buffer 
*** 153,159 ****
  extern void AbortBufferIO(void);
  
  extern void BufmgrCommit(void);
! extern void BufferSync(void);
  extern void BgBufferSync(void);
  
  extern void AtProcExit_LocalBuffers(void);
--- 157,163 ----
  extern void AbortBufferIO(void);
  
  extern void BufmgrCommit(void);
! extern void BufferSync(bool shutdown);
  extern void BgBufferSync(void);
  
  extern void AtProcExit_LocalBuffers(void);
diff -cpr pgsql-orig/src/include/storage/freespace.h pgsql/src/include/storage/freespace.h
*** pgsql-orig/src/include/storage/freespace.h	2006-03-09 10:19:20.000000000 +0900
--- pgsql/src/include/storage/freespace.h	2006-03-09 10:20:50.000000000 +0900
*************** extern BlockNumber RecordAndGetPageWithF
*** 140,145 ****
--- 140,148 ----
  							  BlockNumber oldPage,
  							  Size oldSpaceAvail,
  							  Size spaceNeeded);
+ extern void RecordFreeSpace(RelFileNode *rel,
+ 						BlockNumber page,
+ 						Size avail);
  extern Size GetAvgFSMRequestSize(RelFileNode *rel);
  extern void RecordRelationFreeSpace(RelFileNode *rel,
  						int nPages,
diff -cpr pgsql-orig/src/include/storage/lwlock.h pgsql/src/include/storage/lwlock.h
*** pgsql-orig/src/include/storage/lwlock.h	2006-03-09 10:19:20.000000000 +0900
--- pgsql/src/include/storage/lwlock.h	2006-03-09 10:20:50.000000000 +0900
*************** extern bool LWLockConditionalAcquire(LWL
*** 70,75 ****
--- 70,76 ----
  extern void LWLockRelease(LWLockId lockid);
  extern void LWLockReleaseAll(void);
  extern bool LWLockHeldByMe(LWLockId lockid);
+ extern void LWLockDowngrade(LWLockId lockid);
  
  extern int	NumLWLocks(void);
  extern Size LWLockShmemSize(void);

#61

Simon Riggs

simon@2ndquadrant.com

almost 20 years ago

In reply to: ITAGAKI Takahiro (#60)

Re: [HACKERS] Automatic free space map filling

On Thu, 2006-03-09 at 15:53 +0900, ITAGAKI Takahiro wrote:

"Zeugswetter Andreas DCP SD" <ZeugswetterA@spardat.at> wrote:

Ok, we cannot reuse a dead tuple. Maybe we can reuse the space of a dead
tuple by reducing the tuple to it's header info.

Attached patch realizes the concept of his idea. The dead tuples will be
reduced to their headers are done by bgwriter.

This patch is incomplete, so please discuss in the thread on HACKERS.

I'm interested in this patch but you need to say more about it. I get
the general idea but it would be useful if you could give a full
description of what this patch is trying to do and why.

Thanks,

Best Regards, Simon Riggs

#62

ITAGAKI Takahiro

itagaki.takahiro@lab.ntt.co.jp

almost 20 years ago

In reply to: Simon Riggs (#61)

Re: [PATCHES] Automatic free space map filling

Simon Riggs <simon@2ndquadrant.com> wrote:

"Zeugswetter Andreas DCP SD" <ZeugswetterA@spardat.at> wrote:

Ok, we cannot reuse a dead tuple. Maybe we can reuse the space of a dead
tuple by reducing the tuple to it's header info.

Attached patch realizes the concept of his idea. The dead tuples will be
reduced to their headers are done by bgwriter.

I'm interested in this patch but you need to say more about it. I get
the general idea but it would be useful if you could give a full
description of what this patch is trying to do and why.

OK, I try to explain the patch. Excuse me for a long writing.

* Purpose
The basic idea is just "reducing the dead tuple to it's header info",
suggested by Andreas. This is a lightweight per-page sweeping to reduce
the consumption of free space map and the necessity of VACUUM; i.e,
normal VACUUM is still needed occasionally.

I think it is useful on heavy-update workloads. It showed 5-10% of
performance improvement on DBT-2 after 9 hours running *without* vacuum.
I don't know whether it is still effective with well-scheduled vacuum.

* Why does bgwriter do vacuum?
Sweeping has cost, so non-backend process should do. Also, the page worth
vacuum are almost always dirty, because tuples on the page are just updated
or deleted. Bgwriter treats dirty pages, so I think it is a good place for
sweeping.

* Locking
We must take super-exclusive-lock of the pages before vacuum. In the patch,
bgwriter tries to take exclusive-lock before it writes a page, and does
vacuum only if the lock is super-exclusive. Otherwise, it gives up and
writes the pages normally. This is an optimistic way, but I assume the
possibility is high because the most pages written by bgwriter are least
recently used (LRU).

* Keep the headers
We cannot remove dead tuples completely in per-page sweep, because
references to the tuples from indexes still remains. We might keep only
line pointers (4 bytes), but it might lead line-pointer-bloat problems,
(http://archives.postgresql.org/pgsql-hackers/2006-03/msg00116.php).
so the headers (4+32 byte) should be left.

* Other twists and GUC variables in the patch
- Bgwriter cannot access the catalogs, so I added BM_RELATION hint bit
to BufferDesc. Only relation pages will be swept. This is enabled by
GUC variable 'bgvacuum_relation'.
- I changed bgwriter_lru_maxpages to be adjusted automatically. Backends
won't do vacuum not to disturb their processing, so bgwriter should write
most of dirty pages. ('bgvacuum_autotune')
- After sweepping, the page will be added to free space map. I made a simple
replacement algorithm of free space map, that replaces the page with least
spaces near the added one. ('bgvacuum_fsm')

* Issues
- If WAL is produced by sweeping a page, writing the page should be pended
for a while, because flushing the WAL is needed before writing the page.
- Bgwriter writes pages in 4 contexts, background-writes for LRU, ALL,
checkpoint and shutdown. In current patch, pages are swept in 3 contexts
except shutdown, but it may be better to do only on LRU.

* Related discussions
- Real-Time Vacuum Possibility (Rod Taylor)
http://archives.postgresql.org/pgsql-hackers/2005-03/msg00518.php
| have the bgwriter take a look at the pages it has, and see if it can do
| any vacuum work based on pages it is about to send to disk
- Pre-allocated free space for row updating (like PCTFREE) (Satoshi Nagayasu)
http://archives.postgresql.org/pgsql-hackers/2005-08/msg01135.php
| light-weight repairing on a single page is needed to maintain free space
- Dead Space Map (Heikki Linnakangas)
http://archives.postgresql.org/pgsql-hackers/2006-02/msg01125.php
| vacuuming pages one by one as they're written by bgwriter

Thank you for reading till the last.
I'd like to hear your comments.

---
ITAGAKI Takahiro
NTT Cyber Space Laboratories

#63

Simon Riggs

simon@2ndquadrant.com

almost 20 years ago

In reply to: ITAGAKI Takahiro (#62)

Re: [PATCHES] Automatic free space map filling

On Mon, 2006-03-13 at 17:38 +0900, ITAGAKI Takahiro wrote:

Simon Riggs <simon@2ndquadrant.com> wrote:

"Zeugswetter Andreas DCP SD" <ZeugswetterA@spardat.at> wrote:

Ok, we cannot reuse a dead tuple. Maybe we can reuse the space of a dead
tuple by reducing the tuple to it's header info.

Attached patch realizes the concept of his idea. The dead tuples will be
reduced to their headers are done by bgwriter.

I'm interested in this patch but you need to say more about it. I get
the general idea but it would be useful if you could give a full
description of what this patch is trying to do and why.

OK, I try to explain the patch. Excuse me for a long writing.

OK. I'll take a look at this, thanks.

Best Regards, Simon Riggs

#64

pgman@candle.pha.pa.us

over 19 years ago

In reply to: Tom Lane (#39)

Re: Automatic free space map filling

Tom Lane wrote:

Alvaro Herrera <alvherre@commandprompt.com> writes:

So for you it would certainly help a lot to be able to vacuum the first
X pages of the big table, stop, release locks, create new transaction,
continue with the next X pages, lather, rinse, repeat.

This is perfectly doable, it only needs enough motivation from a
knowledgeable person.

Bruce and I were discussing this the other day; it'd be pretty easy to
make plain VACUUM start a fresh transaction immediately after it
finishes a scan heap/clean indexes/clean heap cycle. The infrastructure
for this (in particular, session-level locks that won't be lost by
closing the xact) is all there. You'd have to figure out how often to
start a new xact ... every cycle is probably too often, at least for
smaller maintenance_work_mem settings ... but it'd not be hard or
involve any strange changes in system semantics.

Should this be a TODO? One item of discussion was taht people should
just increase their workmem so the job can be done faster in larger
batches.

--
Bruce Momjian http://candle.pha.pa.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

#65

jnasby@pervasive.com

over 19 years ago

In reply to: Bruce Momjian (#64)

Re: Automatic free space map filling

On Fri, Apr 28, 2006 at 03:58:16PM -0400, Bruce Momjian wrote:

Tom Lane wrote:

Alvaro Herrera <alvherre@commandprompt.com> writes:

So for you it would certainly help a lot to be able to vacuum the first
X pages of the big table, stop, release locks, create new transaction,
continue with the next X pages, lather, rinse, repeat.

This is perfectly doable, it only needs enough motivation from a
knowledgeable person.

Bruce and I were discussing this the other day; it'd be pretty easy to
make plain VACUUM start a fresh transaction immediately after it
finishes a scan heap/clean indexes/clean heap cycle. The infrastructure
for this (in particular, session-level locks that won't be lost by
closing the xact) is all there. You'd have to figure out how often to
start a new xact ... every cycle is probably too often, at least for
smaller maintenance_work_mem settings ... but it'd not be hard or
involve any strange changes in system semantics.

Should this be a TODO? One item of discussion was taht people should
just increase their workmem so the job can be done faster in larger
batches.

Except that wouldn't help when vacuuming a lot of small tables; each one
would get it's own transaction.

ISTM that tying this directly to maintenance_work_mem is a bit
confusing, since the idea is to keep vacuum transaction duration down so
that it isn't causing dead tuples to build up itself. It seems like it
would be better to have vacuum start a fresh transaction after a certain
number of tuples have died. But since there's no way to actually measure
that without having row level stats turned on, maybe number of
transactions or length of time would be good surrogates.

Since it sounds like we'd want the transaction to start only at the
start of a clean cycle it could just check the limits at the start of
each cycle. That would prevent it from wrapping the vacuum of each small
table with a (rather pointless) new transaction.
--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461

#66

tgl@sss.pgh.pa.us

over 19 years ago

In reply to: Jim C. Nasby (#65)

Re: Automatic free space map filling

"Jim C. Nasby" <jnasby@pervasive.com> writes:

Alvaro Herrera <alvherre@commandprompt.com> writes:

So for you it would certainly help a lot to be able to vacuum the first
X pages of the big table, stop, release locks, create new transaction,
continue with the next X pages, lather, rinse, repeat.

Bruce and I were discussing this the other day; it'd be pretty easy to
make plain VACUUM start a fresh transaction immediately after it
finishes a scan heap/clean indexes/clean heap cycle.

Except that wouldn't help when vacuuming a lot of small tables; each one
would get it's own transaction.

What's your point? There's only a problem for big tables, and VACUUM
already does use a new transaction for each table.

regards, tom lane

#67

Martijn van Oosterhout

kleptog@svana.org

over 19 years ago

In reply to: Jim C. Nasby (#65)

Re: Automatic free space map filling

On Mon, May 01, 2006 at 01:19:30PM -0500, Jim C. Nasby wrote:

ISTM that tying this directly to maintenance_work_mem is a bit
confusing, since the idea is to keep vacuum transaction duration down so
that it isn't causing dead tuples to build up itself. It seems like it
would be better to have vacuum start a fresh transaction after a certain
number of tuples have died. But since there's no way to actually measure
that without having row level stats turned on, maybe number of
transactions or length of time would be good surrogates.

AIUI, vacuum starts a fresh cycle because it's accumulated a certain
number of dead tuples to clean up. Isn't that what you're asking for?
maintenance_work_mem is the limit on the amount of deleted tuple
information that can be stored (amongst other things I'm sure)...

Since it sounds like we'd want the transaction to start only at the
start of a clean cycle it could just check the limits at the start of
each cycle. That would prevent it from wrapping the vacuum of each small
table with a (rather pointless) new transaction.

Every table has to be in its own transaction since thats the duration
of the locks. Vacuum handling multiple tables in one transaction leaves
you open to deadlocks.

Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

From each according to his ability. To each according to his ability to litigate.

#68

Dawid Kuroczko

qnex42@gmail.com

over 19 years ago

In reply to: Martijn van Oosterhout (#67)

Re: Automatic free space map filling

On 5/1/06, Martijn van Oosterhout <kleptog@svana.org> wrote:

On Mon, May 01, 2006 at 01:19:30PM -0500, Jim C. Nasby wrote:

ISTM that tying this directly to maintenance_work_mem is a bit
confusing, since the idea is to keep vacuum transaction duration down so
that it isn't causing dead tuples to build up itself. It seems like it
would be better to have vacuum start a fresh transaction after a certain
number of tuples have died. But since there's no way to actually measure
that without having row level stats turned on, maybe number of
transactions or length of time would be good surrogates.

AIUI, vacuum starts a fresh cycle because it's accumulated a certain
number of dead tuples to clean up. Isn't that what you're asking for?
maintenance_work_mem is the limit on the amount of deleted tuple
information that can be stored (amongst other things I'm sure)...

Hmm, one idea, which may (or may not) be interesting for large
table vacuum is allowing a syntax similar to:

VACUUM table WHERE some_col > now()-'1 hour'::interval;

I.e. Let vacuum run "piggyback" on some index. This would allow
for a quick vacuum of a fraction of a large table. Especially when
the table is large, and only some data (new data) are being modified.

The vacuum for such a table would:
1. scan the index accoriding to the where criteria and create bitmap
of blocks to look at.
2. go through these blocks and vacuum them.

Hmm, another perhaps silly idea -- a special index kind for tracking
tuple deaths. Ie -- something like whenever tuple is updated/deleted,
insert an entry into such index, using last session the tuple is visible
for as a key. Then, perhaps, vacuum could scan such an index and
find tuples which are candidates for removal. I lack the knowledge of
PostgreSQL's internals, so forgive me if I am writing something
completely insane. :)

Regards,
Dawid

#69

Simon Riggs

simon@2ndquadrant.com

over 19 years ago

In reply to: Bruce Momjian (#64)

Re: Automatic free space map filling

On Fri, 2006-04-28 at 15:58 -0400, Bruce Momjian wrote:

Tom Lane wrote:

Alvaro Herrera <alvherre@commandprompt.com> writes:

So for you it would certainly help a lot to be able to vacuum the first
X pages of the big table, stop, release locks, create new transaction,
continue with the next X pages, lather, rinse, repeat.

This is perfectly doable, it only needs enough motivation from a
knowledgeable person.

Bruce and I were discussing this the other day; it'd be pretty easy to
make plain VACUUM start a fresh transaction immediately after it
finishes a scan heap/clean indexes/clean heap cycle. The infrastructure
for this (in particular, session-level locks that won't be lost by
closing the xact) is all there. You'd have to figure out how often to
start a new xact ... every cycle is probably too often, at least for
smaller maintenance_work_mem settings ... but it'd not be hard or
involve any strange changes in system semantics.

Should this be a TODO? One item of discussion was taht people should
just increase their workmem so the job can be done faster in larger
batches.

Yes, I think it should be a todo item.

Csaba's point was that it was the duration a VACUUM transaction was held
open that caused problems. Increasing maintenance_work_mem won't help
with that problem.

This would then allow a VACUUM to progress with a high vacuum_cost_delay
without any ill effects elsewhere in the system.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

#70

jnasby@pervasive.com

over 19 years ago

In reply to: Dawid Kuroczko (#68)

Re: Automatic free space map filling

On Mon, May 01, 2006 at 10:24:50PM +0200, Dawid Kuroczko wrote:

VACUUM table WHERE some_col > now()-'1 hour'::interval;

I.e. Let vacuum run "piggyback" on some index. This would allow
for a quick vacuum of a fraction of a large table. Especially when
the table is large, and only some data (new data) are being modified.

The vacuum for such a table would:
1. scan the index accoriding to the where criteria and create bitmap
of blocks to look at.
2. go through these blocks and vacuum them.

Hmm, another perhaps silly idea -- a special index kind for tracking
tuple deaths. Ie -- something like whenever tuple is updated/deleted,
insert an entry into such index, using last session the tuple is visible
for as a key. Then, perhaps, vacuum could scan such an index and
find tuples which are candidates for removal. I lack the knowledge of
PostgreSQL's internals, so forgive me if I am writing something
completely insane. :)

There is a TODO to create a 'dead space map' which would cover #2 and
probably eliminate any use for #1.
--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461

#71

hannu@skype.net

over 19 years ago

In reply to: Tom Lane (#39)

Re: Automatic free space map filling

Ühel kenal päeval, R, 2006-03-03 kell 11:37, kirjutas Tom Lane:

Alvaro Herrera <alvherre@commandprompt.com> writes:

So for you it would certainly help a lot to be able to vacuum the first
X pages of the big table, stop, release locks, create new transaction,
continue with the next X pages, lather, rinse, repeat.

This is perfectly doable, it only needs enough motivation from a
knowledgeable person.

Bruce and I were discussing this the other day; it'd be pretty easy to
make plain VACUUM start a fresh transaction immediately after it
finishes a scan heap/clean indexes/clean heap cycle.

Do you mean the full (scan heap/clean indexes/clean heap) cycle or some
smaller cycles inside each step ?

If you mean the full cycle, then it is probably not worth it, as even a
single 'clean index' pass can take hours on larger tables.

The infrastructure
for this (in particular, session-level locks that won't be lost by
closing the xact) is all there. You'd have to figure out how often to
start a new xact ... every cycle is probably too often, at least for
smaller maintenance_work_mem settings ... but it'd not be hard or
involve any strange changes in system semantics.

-----------
Hannu

#72

tgl@sss.pgh.pa.us

over 19 years ago

In reply to: Hannu Krosing (#71)

Re: Automatic free space map filling

Hannu Krosing <hannu@skype.net> writes:

If you mean the full cycle, then it is probably not worth it, as even a
single 'clean index' pass can take hours on larger tables.

The patch Heikki is working on will probably alleviate that problem,
because it will allow vacuum to scan the indexes in physical rather than
logical order.

regards, tom lane

#73