Plans for solving the VACUUM problem
I have been thinking about the problem of VACUUM and how we might fix it
for 7.2. Vadim has suggested that we should attack this by implementing
an overwriting storage manager and transaction UNDO, but I'm not totally
comfortable with that approach: it seems to me that it's an awfully large
change in the way Postgres works. Instead, here is a sketch of an attack
that I think fits better into the existing system structure.
First point: I don't think we need to get rid of VACUUM, exactly. What
we want for 24x7 operation is to be able to do whatever housekeeping we
need without locking out normal transaction processing for long intervals.
We could live with routine VACUUMs if they could run in parallel with
reads and writes of the table being vacuumed. They don't even have to run
in parallel with schema updates of the target table (CREATE/DROP INDEX,
ALTER TABLE, etc). Schema updates aren't things you do lightly for big
tables anyhow. So what we want is more of a "background VACUUM" than a
"no VACUUM" solution.
Second: if VACUUM can run in the background, then there's no reason not
to run it fairly frequently. In fact, it could become an automatically
scheduled activity like CHECKPOINT is now, or perhaps even a continuously
running daemon (which was the original conception of it at Berkeley, BTW).
This is important because it means that VACUUM doesn't have to be perfect.
The existing VACUUM code goes to huge lengths to ensure that it compacts
the table as much as possible. We don't need that; if we miss some free
space this time around, but we can expect to get it the next time (or
eventually), we can be happy. This leads to thinking of space management
in terms of steady-state behavior, rather than the periodic "big bang"
approach that VACUUM represents now.
But having said that, there's no reason to remove the existing VACUUM
code: we can keep it around for situations where you need to crunch a
table as much as possible and you can afford to lock the table while
you do it. The new code would be a new command, maybe "VACUUM LAZY"
(or some other name entirely).
Enough handwaving, what about specifics?
1. Forget moving tuples from one page to another. Doing that in a
transaction-safe way is hugely expensive and complicated. Lazy VACUUM
will only delete dead tuples and coalesce the free space thus made
available within each page of a relation.
2. This does no good unless there's a provision to re-use that free space.
To do that, I propose a free space map (FSM) kept in shared memory, which
will tell backends which pages of a relation have free space. Only if the
FSM shows no free space available will the relation be extended to insert
a new or updated tuple.
3. Lazy VACUUM processes a table in five stages:
A. Scan relation looking for dead tuples; accumulate a list of their
TIDs, as well as info about existing free space. (This pass is
completely read-only and so incurs no WAL traffic.)
B. Remove index entries for the dead tuples. (See below for details.)
C. Physically delete dead tuples and compact free space on their pages.
D. Truncate any completely-empty pages at relation's end. (Optional,
see below.)
E. Create/update FSM entry for the table.
Note that this is crash-safe as long as the individual update operations
are atomic (which can be guaranteed by WAL entries for them). If a tuple
is dead, we care not whether its index entries are still around or not;
so there's no risk to logical consistency.
4. Observe that lazy VACUUM need not really be a transaction at all, since
there's nothing it does that needs to be cancelled or undone if it is
aborted. This means that its WAL entries do not have to hang around past
the next checkpoint, which solves the huge-WAL-space-usage problem that
people have noticed while VACUUMing large tables under 7.1.
5. Also note that there's nothing saying that lazy VACUUM must do the
entire table in one go; once it's accumulated a big enough batch of dead
tuples, it can proceed through steps B,C,D,E even though it's not scanned
the whole table. This avoids a rather nasty problem that VACUUM has
always had with running out of memory on huge tables.
Free space map details
----------------------
I envision the FSM as a shared hash table keyed by table ID, with each
entry containing a list of page numbers and free space in each such page.
The FSM is empty at system startup and is filled by lazy VACUUM as it
processes each table. Backends then decrement/remove page entries as they
use free space.
Critical point: the FSM is only a hint and does not have to be perfectly
accurate. It can omit space that's actually available without harm, and
if it claims there's more space available on a page than there actually
is, we haven't lost much except a wasted ReadBuffer cycle. This allows
us to take shortcuts in maintaining it. In particular, we can constrain
the FSM to a prespecified size, which is critical for keeping it in shared
memory. We just discard entries (pages or whole relations) as necessary
to keep it under budget. Obviously, we'd not bother to make entries in
the first place for pages with only a little free space. Relation entries
might be discarded on a least-recently-used basis.
Accesses to the FSM could create contention problems if we're not careful.
I think this can be dealt with by having each backend remember (in its
relcache entry for a table) the page number of the last page it chose from
the FSM to insert into. That backend will keep inserting new tuples into
that same page, without touching the FSM, as long as there's room there.
Only then does it go back to the FSM, update or remove that page entry,
and choose another page to start inserting on. This reduces the access
load on the FSM from once per tuple to once per page. (Moreover, we can
arrange that successive backends consulting the FSM pick different pages
if possible. Then, concurrent inserts will tend to go to different pages,
reducing contention for shared buffers; yet any single backend does
sequential inserts in one page, so that a bulk load doesn't cause
disk traffic scattered all over the table.)
The FSM can also cache the overall relation size, saving an lseek kernel
call whenever we do have to extend the relation for lack of internal free
space. This will help pay for the locking cost of accessing the FSM.
Locking issues
--------------
We will need two extensions to the lock manager:
1. A new lock type that allows concurrent reads and writes
(AccessShareLock, RowShareLock, RowExclusiveLock) but not anything else.
Lazy VACUUM will grab this type of table lock to ensure the table schema
doesn't change under it. Call it a VacuumLock until we think of a better
name.
2. A "conditional lock" operation that acquires a lock if available, but
doesn't block if not.
The conditional lock will be used by lazy VACUUM to try to upgrade its
VacuumLock to an AccessExclusiveLock at step D (truncate table). If it's
able to get exclusive lock, it's safe to truncate any unused end pages.
Without exclusive lock, it's not, since there might be concurrent
transactions scanning or inserting into the empty pages. We do not want
lazy VACUUM to block waiting to do this, since if it does that it will
create a lockout situation (reader/writer transactions will stack up
behind it in the lock queue while everyone waits for the existing
reader/writer transactions to finish). Better to not do the truncation.
Another place where lazy VACUUM may be unable to do its job completely
is in compaction of space on individual disk pages. It can physically
move tuples to perform compaction only if there are not currently any
other backends with pointers into that page (which can be tested by
looking to see if the buffer reference count is one). Again, we punt
and leave the space to be compacted next time if we can't do it right
away.
The fact that inserted/updated tuples might wind up anywhere in the table,
not only at the end, creates no headaches except for heap_update. That
routine needs buffer locks on both the page containing the old tuple and
the page that will contain the new. To avoid possible deadlocks between
different backends locking the same two pages in opposite orders, we need
to constrain the lock ordering used by heap_update. This is doable but
will require slightly more code than is there now.
Index access method improvements
--------------------------------
Presently, VACUUM deletes index tuples by doing a standard index scan
and checking each returned index tuple to see if it points at any of
the tuples to be deleted. If so, the index AM is called back to delete
the tested index tuple. This is horribly inefficient: it means one trip
into the index AM (with associated buffer lock/unlock and search overhead)
for each tuple in the index, plus another such trip for each tuple actually
deleted.
This is mainly a problem of a poorly chosen API. The index AMs should
offer a "bulk delete" call, which is passed a sorted array of main-table
TIDs. The loop over the index tuples should happen internally to the
index AM. At least in the case of btree, this could be done by a
sequential scan over the index pages, which avoids the random I/O of an
index-order scan and so should offer additional speedup.
Further out (possibly not for 7.2), we should also look at making the
index AMs responsible for shrinking indexes during deletion, or perhaps
via a separate "vacuum index" API. This can be done without exclusive
locks on the index --- the original Lehman & Yao concurrent-btrees paper
didn't describe how, but more recent papers show how to do it. As with
the main tables, I think it's sufficient to recycle freed space within
the index, and not necessarily try to give it back to the OS.
We will also want to look at upgrading the non-btree index types to allow
concurrent operations. This may be a research problem; I don't expect to
touch that issue for 7.2. (Hence, lazy VACUUM on tables with non-btree
indexes will still create lockouts until this is addressed. But note that
the lockout only lasts through step B of the VACUUM, not the whole thing.)
There you have it. If people like this, I'm prepared to commit to
making it happen for 7.2. Comments, objections, better ideas?
regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> writes:
[...]
There you have it. If people like this, I'm prepared to commit to
making it happen for 7.2. Comments, objections, better ideas?
I'm just commenting from the peanut gallery, but it looks really
well-thought-out to me. I like the general "fast, good enough, not
necessarily perfect" approach you've taken.
I'd be happy to help test and debug once things get going.
-Doug
--
The rain man gave me two cures; he said jump right in,
The first was Texas medicine--the second was just railroad gin,
And like a fool I mixed them, and it strangled up my mind,
Now people just get uglier, and I got no sense of time... --Dylan
Import Notes
Reply to msg id not found: TomLanesmessageofThu17May2001190524-0400
Very neat. You mention that the truncation of both heap and index
relations is not necessarily mandatory. Under what conditions would
either of them be truncated?
Mike Mascari
mascarm@mascari.com
-----Original Message-----
From: Tom Lane [SMTP:tgl@sss.pgh.pa.us]
....
3. Lazy VACUUM processes a table in five stages:
A. Scan relation looking for dead tuples; accumulate a list of
their
TIDs, as well as info about existing free space. (This pass is
completely read-only and so incurs no WAL traffic.)
B. Remove index entries for the dead tuples. (See below for
details.)
C. Physically delete dead tuples and compact free space on their
pages.
D. Truncate any completely-empty pages at relation's end.
(Optional,
see below.)
E. Create/update FSM entry for the table.
...
Further out (possibly not for 7.2), we should also look at making the
index AMs responsible for shrinking indexes during deletion, or
perhaps
via a separate "vacuum index" API. This can be done without
exclusive
locks on the index --- the original Lehman & Yao concurrent-btrees
paper
didn't describe how, but more recent papers show how to do it. As
with
the main tables, I think it's sufficient to recycle freed space
within
the index, and not necessarily try to give it back to the OS.
...
regards, tom lane
Import Notes
Resolved by subject fallback
Tom Lane wrote:
I love it all.
I agree that vacuum should be an optional function that really packs tables.
I also like the idea of a vacuum that runs in the background and does not too
badly affect operation.
My only suggestion would be to store some information in the statistics about
whether or not, and how bad, a table needs to be vacuumed. In a scheduled
background environment, the tables that need it most should get it most often.
Often times many tables never need to be vacuumed.
Also, it would be good to be able to update the statistics without doing a
vacuum, i.e. rather than having to vacuum to analyze, being able to analyze
without a vacuum.
Free space map details
----------------------I envision the FSM as a shared hash table keyed by table ID, with each
entry containing a list of page numbers and free space in each such page.The FSM is empty at system startup and is filled by lazy VACUUM as it
processes each table. Backends then decrement/remove page entries as they
use free space.Critical point: the FSM is only a hint and does not have to be perfectly
accurate. It can omit space that's actually available without harm, and
if it claims there's more space available on a page than there actually
is, we haven't lost much except a wasted ReadBuffer cycle. This allows
us to take shortcuts in maintaining it. In particular, we can constrain
the FSM to a prespecified size, which is critical for keeping it in shared
memory. We just discard entries (pages or whole relations) as necessary
to keep it under budget. Obviously, we'd not bother to make entries in
the first place for pages with only a little free space. Relation entries
might be discarded on a least-recently-used basis.
The only question I have is about the Free Space Map. It would seem
better to me if we could get this map closer to the table itself, rather
than having every table of every database mixed into the same shared
memory area. I can just see random table access clearing out most of
the map cache and perhaps making it less useless.
It would be nice if we could store the map on the first page of the disk
table, or store it in a flat file per table. I know both of these ideas
will not work, but I am just throwing it out to see if someone has a
better idea.
I wonder if cache failures should be what drives the vacuum daemon to
vacuum a table? Sort of like, "Hey, someone is asking for free pages
for that table. Let's go find some!" That may work really well.
Another advantage of centralization is that we can record update/delete
counters per table, helping tell vacuum where to vacuum next. Vacuum
roaming around looking for old tuples seems wasteful.
Also, I suppose if we have the map act as a shared table cache (fseek
info), it may override the disadvantage of having it all centralized.
I know I am throwing out the advantages and disadvantages of
centralization, but I thought I would give out the ideas.
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
Heck ya...
I wonder if cache failures should be what drives the vacuum daemon to
vacuum a table? Sort of like, "Hey, someone is asking for free pages
for that table. Let's go find some!" That may work really well.
Another advantage of centralization is that we can record update/delete
counters per table, helping tell vacuum where to vacuum next. Vacuum
roaming around looking for old tuples seems wasteful.
Counters seem like a nice addition. For example, access patterns to session
tables are almost pure UPDATE/DELETEs and a ton of them. On the other hand,
log type tables see no UPDATE/DELETE but tend to be huge in comparision. I
suspect many applications outside ours will show large disparties in the
"Vacuumability" score for different tables.
Quick question:
Using lazy vacuum, if I have two identical (at the file level) copies of a
database, run the same queries against them for a few days, then shut them
down again, are the copies still identical? Is this different than the
current behavior (ie, queries, full vacuum)?
AZ
Tom Lane wrote:
mlw <markw@mohawksoft.com> writes:
Also, it would be good to be able to update the statistics without doing a
vacuum, i.e. rather than having to vacuum to analyze, being able to analyze
without a vacuum.Irrelevant, not to mention already done ...
Do you mean that we already can do just analyze ?
What is the syntax ?
----------------
Hannu
Tom Lane wrote:
Enough handwaving, what about specifics?
1. Forget moving tuples from one page to another. Doing that in a
transaction-safe way is hugely expensive and complicated. Lazy VACUUM
will only delete dead tuples and coalesce the free space thus made
available within each page of a relation.
If we really need to move a tuple we can do it by a regular update that
SET-s nothing and just copies the tuple to another page inside a
separate
transaction.
-------------------
Hannu
Also, it would be good to be able to update the statistics without doing a
vacuum, i.e. rather than having to vacuum to analyze, being able to
analyze
without a vacuum.
I was going to ask the same thing. In a lot of situations (insert
dominated) vacumm analyze is more important for the update of statistics
then for recovery of space. Could we roll that into this? Or should we
also have an Analyze daemon?
Free space map details
----------------------Accesses to the FSM could create contention problems if we're not careful.
Another quick thought for handling FSM contention problems. A backend could
give up waiting for access to the FSM after a short period of time, and just
append it's data to the end of the file the same way it's done now. Dunno
if that is feasable but it seemed like an idea to me.
Other than that, I would just like to say this will be a great improvement
for pgsql. Tom, you and several other on this list continue to impress the
hell out of me.
mlw <markw@mohawksoft.com> writes:
My only suggestion would be to store some information in the statistics about
whether or not, and how bad, a table needs to be vacuumed.
I was toying with the notion of using the FSM to derive that info,
somewhat indirectly to be sure (since what the FSM could tell you would
be about tuples inserted not tuples deleted). Heavily used FSM entries
would be the vacuum daemon's cues for tables to hit more often.
ANALYZE stats don't seem like a productive way to attack this, since
there's no guarantee they'd be updated often enough. If the overall
data distribution of a table isn't changing much, there's no need to
analyze it often.
Also, it would be good to be able to update the statistics without doing a
vacuum, i.e. rather than having to vacuum to analyze, being able to analyze
without a vacuum.
Irrelevant, not to mention already done ...
regards, tom lane
On Thu, 17 May 2001, Tom Lane wrote:
I have been thinking about the problem of VACUUM and how we might fix it
for 7.2. Vadim has suggested that we should attack this by implementing
an overwriting storage manager and transaction UNDO, but I'm not totally
comfortable with that approach: it seems to me that it's an awfully large
change in the way Postgres works. Instead, here is a sketch of an attack
that I think fits better into the existing system structure.
<snip>
My AUD0.02, FWIW, is this sounds great. You said you were only planning to
concentrate on performance enhancements, not new features, Tom, but IMHO
this is a new feature and a good one :).
As several others have mentioned, automatic analyze would also be nice. I
gather the backend already has the ability to treat analyze as a separate
process, so presumably this is a completely separate issue from automatic
vacuum. Some sort of background daemon or whatever would be good. And
again, one could take the approach that it doesn't have to get it 100%
right, at least in the short term; as long as it is continually
incrementing itself in the direction of accurate statistics, then that's
much better than the current situation. Presumably one could also retain
the option of doing an explicit analyze occasionally, if you have
processor cycles to burn and are really keen to get the stats correct in
a hurry.
Tim
--
-----------------------------------------------
Tim Allen tim@proximity.com.au
Proximity Pty Ltd http://www.proximity.com.au/
http://www4.tpg.com.au/users/rita_tim/
Mike Mascari <mascarm@mascari.com> writes:
Very neat. You mention that the truncation of both heap and index
relations is not necessarily mandatory. Under what conditions would
either of them be truncated?
In the proposal as given, a heap file would be truncated if (a) it
has at least one totally empty block at the end, and (b) no other
transaction is touching the table at the instant that VACUUM is
ready to truncate it.
This would probably be fairly infrequently true, especially for
heavily used tables, but if you believe in a "steady state" analysis
then that's just fine. No point in handing blocks back to the OS
only to have to allocate them again soon.
We might want to try to tilt the FSM-driven reuse of freed space
to favor space near the start of the file and avoid end blocks.
Without that, you might never get totally free blocks at the end.
The same comments hold for index blocks, with the additional problem
that the index structure would make it almost impossible to drive usage
away from the physical end-of-file. For btrees I think it'd be
sufficient if we could recycle empty blocks for use elsewhere in the
btree structure. Actually shrinking the index probably won't happen
short of a REINDEX.
regards, tom lane
Tom Lane wrote:
I have been thinking about the problem of VACUUM and how we might fix it
for 7.2. Vadim has suggested that we should attack this by implementing
an overwriting storage manager and transaction UNDO, but I'm not totally
comfortable with that approach:
IIRC, Vadim doesn't intend to implement an overwriting
smgr at least in the near future though he seems to
have a plan to implement a functionality to allow space
re-use without vacuum as marked in TODO. IMHO UNDO
functionality under no overwriting(ISTM much easier
than under overwriting) smgr has the highest priority.
Savepoints is planned for 7.0 but we don't have it yet.
[snip]
Locking issues
--------------We will need two extensions to the lock manager:
1. A new lock type that allows concurrent reads and writes
(AccessShareLock, RowShareLock, RowExclusiveLock) but not
anything else.
What's different from RowExclusiveLock ?
Does it conflict with itself ?
The conditional lock will be used by lazy VACUUM to try to upgrade its
VacuumLock to an AccessExclusiveLock at step D (truncate table). If it's
able to get exclusive lock, it's safe to truncate any unused end pages.
Without exclusive lock, it's not, since there might be concurrent
transactions scanning or inserting into the empty pages. We do not want
lazy VACUUM to block waiting to do this, since if it does that it will
create a lockout situation (reader/writer transactions will stack up
behind it in the lock queue while everyone waits for the existing
reader/writer transactions to finish). Better to not do the truncation.
And would the truncation occur that often in reality under
the scheme(without tuple movement) ?
[snip]
Index access method improvements
--------------------------------Presently, VACUUM deletes index tuples by doing a standard index scan
and checking each returned index tuple to see if it points at any of
the tuples to be deleted. If so, the index AM is called back to delete
the tested index tuple. This is horribly inefficient: it means one trip
into the index AM (with associated buffer lock/unlock and search overhead)
for each tuple in the index, plus another such trip for each tuple actually
deleted.This is mainly a problem of a poorly chosen API. The index AMs should
offer a "bulk delete" call, which is passed a sorted array of main-table
TIDs. The loop over the index tuples should happen internally to the
index AM. At least in the case of btree, this could be done by a
sequential scan over the index pages, which avoids the random I/O of an
index-order scan and so should offer additional speedup.
???? Isn't current implementation "bulk delete" ?
Fast access to individual index tuples seems to be
also needed in case a few dead tuples.
Further out (possibly not for 7.2), we should also look at making the
index AMs responsible for shrinking indexes during deletion, or perhaps
via a separate "vacuum index" API. This can be done without exclusive
locks on the index --- the original Lehman & Yao concurrent-btrees paper
didn't describe how, but more recent papers show how to do it. As with
the main tables, I think it's sufficient to recycle freed space within
the index, and not necessarily try to give it back to the OS.
Great. There would be few disadvantages in our btree
implementation.
regards,
Hiroshi Inoue
"Matthew T. O'Connor" <matthew@zeut.net> writes:
Another quick thought for handling FSM contention problems. A backend could
give up waiting for access to the FSM after a short period of time, and just
append it's data to the end of the file the same way it's done now. Dunno
if that is feasable but it seemed like an idea to me.
Mmm ... maybe, but I doubt it'd help much. Appending a page to the file
requires grabbing some kind of lock anyway (since you can't have two
backends doing it at the same instant). With any luck, that locking can
be merged with the locking involved in accessing the FSM.
regards, tom lane
Bruce Momjian <pgman@candle.pha.pa.us> writes:
The only question I have is about the Free Space Map. It would seem
better to me if we could get this map closer to the table itself, rather
than having every table of every database mixed into the same shared
memory area. I can just see random table access clearing out most of
the map cache and perhaps making it less useless.
What random access? Read transactions will never touch the FSM at all.
As for writes, seems to me the places you are writing are exactly the
places you need info for.
You make a good point, which is that we don't want a schedule-driven
VACUUM to load FSM entries for unused tables into the map at the cost
of throwing out entries that *are* being used. But it seems to me that
that's easily dealt with if we recognize the risk.
It would be nice if we could store the map on the first page of the disk
table, or store it in a flat file per table. I know both of these ideas
will not work,
You said it. What's wrong with shared memory? You can't get any closer
than shared memory: keeping maps in the files would mean you'd need to
chew up shared-buffer space to get at them. (And what was that about
random accesses causing your maps to get dropped? That would happen
for sure if they live in shared buffers.)
Another problem with keeping stuff in the first page: what happens when
the table gets big enough that 8k of map data isn't really enough?
With a shared-memory area, we can fairly easily allocate a variable
amount of space based on total size of a relation vs. total size of
relations under management.
It is true that a shared-memory map would be useless at system startup,
until VACUUM has run and filled in some info. But I don't see that as
a big drawback. People who aren't developers like us don't restart
their postmasters every five minutes.
Another advantage of centralization is that we can record update/delete
counters per table, helping tell vacuum where to vacuum next. Vacuum
roaming around looking for old tuples seems wasteful.
Indeed. But I thought you were arguing against centralization?
regards, tom lane
At 19:05 17/05/01 -0400, Tom Lane wrote:
But having said that, there's no reason to remove the existing VACUUM
code: we can keep it around for situations where you need to crunch a
table as much as possible and you can afford to lock the table while
you do it.
It would be great if this was the *only* reason to use the old-style
VACUUM. ie. We should try to avoid a solution that has a VACCUM LAZY in
background and a recommendation to a 'VACUMM PROPERLY' once in a while.
The new code would be a new command, maybe "VACUUM LAZY"
(or some other name entirely).
Maybe a name that reflects it's strength/purpose: 'VACUUM
ONLINE/BACKGROUND/NOLOCKS/CONCURRENT' etc.
Enough handwaving, what about specifics?
1. Forget moving tuples from one page to another. Doing that in a
transaction-safe way is hugely expensive and complicated. Lazy VACUUM
will only delete dead tuples and coalesce the free space thus made
available within each page of a relation.
Could this be done opportunistically, meaning it builds up a list of
candidates to move (perhaps based on emptiness of page), then moves a
subset of these in each pass? It's only really useful in the case of a
table that has a high update load then becomes static. Which is not as
unusual as it sounds: people do archive tables by renaming them, then
create a new lean 'current' table. With the new vacuum, the static table
may end up with many half-empty pages that are never reused.
2. This does no good unless there's a provision to re-use that free space.
To do that, I propose a free space map (FSM) kept in shared memory, which
will tell backends which pages of a relation have free space. Only if the
FSM shows no free space available will the relation be extended to insert
a new or updated tuple.
I assume that now is not a good time to bring up memory-mapped files? ;-}
3. Lazy VACUUM processes a table in five stages:
A. Scan relation looking for dead tuples; accumulate a list of their
TIDs, as well as info about existing free space. (This pass is
completely read-only and so incurs no WAL traffic.)
Were you planning on just a free byte count, or something smaller? Dec/RDB
uses a nast system of DBA-defined thresholds for each storage area: 4 bits,
where 0=empty, and 1, 2 & 3 indicate above/below thresholds (3 is also
considered 'full'). The thresholds are usually set based on average record
sizes. In this day & age, I suspect a 1 byte percentage, or 2 byte count is
OK unless space is really a premium.
5. Also note that there's nothing saying that lazy VACUUM must do the
entire table in one go; once it's accumulated a big enough batch of dead
tuples, it can proceed through steps B,C,D,E even though it's not scanned
the whole table. This avoids a rather nasty problem that VACUUM has
always had with running out of memory on huge tables.
This sounds great, especially if the same approach could be adopted when/if
moving records.
Critical point: the FSM is only a hint and does not have to be perfectly
accurate. It can omit space that's actually available without harm, and
if it claims there's more space available on a page than there actually
is, we haven't lost much except a wasted ReadBuffer cycle.
So long as you store the # of bytes (or %), that should be fine. One of the
horrors of the Dec/RDB system is that with badly set threholds you can
cycle through many pages looking for one that *really* has enough free space.
Also, would the detecting process fix the bad entry?
This allows
us to take shortcuts in maintaining it. In particular, we can constrain
the FSM to a prespecified size, which is critical for keeping it in shared
memory. We just discard entries (pages or whole relations) as necessary
to keep it under budget.
Presumably keeping the 'most empty' pages?
Obviously, we'd not bother to make entries in
the first place for pages with only a little free space. Relation entries
might be discarded on a least-recently-used basis.
You also might want to record some 'average/min/max' record size for the
table to assess when a page's free space is insufficient for the
average/minimum record size.
While on the subject of record keeping, it would be great if it was coded
to collect statistics about it's own operation for Jan's stats package.
Accesses to the FSM could create contention problems if we're not careful.
I think this can be dealt with by having each backend remember (in its
relcache entry for a table) the page number of the last page it chose from
the FSM to insert into. That backend will keep inserting new tuples into
that same page, without touching the FSM, as long as there's room there.
Only then does it go back to the FSM, update or remove that page entry,
and choose another page to start inserting on. This reduces the access
load on the FSM from once per tuple to once per page.
This seems to have the potential to create as many false FSM page entries
as there are backends. Is it really that expensive to lock the FSM table
entry, subtract a number, then unlock it? especially when compared to the
cost of adding/updating a whole record.
(Moreover, we can
arrange that successive backends consulting the FSM pick different pages
if possible. Then, concurrent inserts will tend to go to different pages,
reducing contention for shared buffers; yet any single backend does
sequential inserts in one page, so that a bulk load doesn't cause
disk traffic scattered all over the table.)
This will also increase the natural clustering of the database for SERIAL
fields even though the overall ordering will be all over the place (at
least for insert-intensive apps).
Locking issues
--------------We will need two extensions to the lock manager:
1. A new lock type that allows concurrent reads and writes
(AccessShareLock, RowShareLock, RowExclusiveLock) but not anything else.
Lazy VACUUM will grab this type of table lock to ensure the table schema
doesn't change under it. Call it a VacuumLock until we think of a better
name.
Is it possible/worth adding a 'blocking notification' to the lock manager.
Then VACUUM could choose to terminate/restart when someone wants to do a
schema change. This is realy only relevant if the VACUUM will be prolonged.
----------------------------------------------------------------
Philip Warner | __---_____
Albatross Consulting Pty. Ltd. |----/ - \
(A.B.N. 75 008 659 498) | /(@) ______---_
Tel: (+61) 0500 83 82 81 | _________ \
Fax: (+61) 0500 83 82 82 | ___________ |
Http://www.rhyme.com.au | / \|
| --________--
PGP key available upon request, | /
and from pgp5.ai.mit.edu:11371 |/
Hiroshi Inoue <Inoue@tpf.co.jp> writes:
Tom Lane wrote:
1. A new lock type that allows concurrent reads and writes
(AccessShareLock, RowShareLock, RowExclusiveLock) but not
anything else.
What's different from RowExclusiveLock ?
I wanted something that *does* conflict with itself, thereby ensuring
that only one instance of VACUUM can be running on a table at a time.
This is not absolutely necessary, perhaps, but without it matters would
be more complex for little gain that I can see. (For example, the tuple
TIDs that lazy VACUUM gathers in step A might already be deleted and
compacted out of existence by another VACUUM, and then reused as new
tuples, before the first VACUUM gets back to the page to delete them.
There would need to be a defense against that if we allow concurrent
VACUUMs.)
And would the truncation occur that often in reality under
the scheme(without tuple movement) ?
Probably not, per my comments to someone else. I'm not very concerned
about that, as long as we are able to recycle freed space within the
relation.
We could in fact move tuples if we wanted to --- it's not fundamentally
different from an UPDATE --- but then VACUUM becomes a transaction and
we have the WAL-log-traffic problem back again. I'd like to try it
without that and see if it gets the job done.
???? Isn't current implementation "bulk delete" ?
No, the index AM is called separately for each index tuple to be
deleted; more to the point, the search for deletable index tuples
should be moved inside the index AM for performance reasons.
Fast access to individual index tuples seems to be
also needed in case a few dead tuples.
Yes, that is the approach Vadim used in the LAZY VACUUM code he did
last summer for Alfred Perlstein. I had a couple of reasons for not
duplicating that method:
1. Searching for individual index tuples requires hanging onto (or
refetching) the index key data for each target tuple. A bulk delete
based only on tuple TIDs is simpler and requires little memory.
2. The main reason for that form of lazy VACUUM is to minimize the time
spent holding the exclusive lock on a table. With this approach we
don't need to worry so much about that. A background task trawling
through an index won't really bother anyone, since it won't lock out
updates.
3. If we're concerned about the speed of lazy VACUUM when there's few
tuples to be recycled, there's a really easy answer: don't recycle 'em.
Leave 'em be until there are enough to make it worth going after 'em.
Remember this is a "fast and good enough" approach, not a "get every
possible byte every time" approach.
Using a key-driven delete when we have only a few deletable tuples might
be a useful improvement later on, but I don't think it's necessary to
bother with it in the first go-round. This is a big project already...
regards, tom lane
Philip Warner <pjw@rhyme.com.au> writes:
At 19:05 17/05/01 -0400, Tom Lane wrote:
1. Forget moving tuples from one page to another.
Could this be done opportunistically, meaning it builds up a list of
candidates to move (perhaps based on emptiness of page), then moves a
subset of these in each pass?
Well, if we move tuples at all then we have a considerably different
animal: to move tuples across pages you must be a transaction so that
you can have an atomic commit for both pages, and that brings back the
issue of how long the transaction runs for and how large its WAL trail
will grow before it can be dropped. Yeah, you could move a limited
number of tuples, commit, and start again ... but it's not so
lightweight anymore.
Perhaps we will eventually end up with three strengths of VACUUM:
the existing heavy-duty form, the "lazy" form that isn't transactional,
and an intermediate form that is willing to move tuples in simpler
cases (none of that tuple-chain-moving stuff please ;-)). But I'm
not buying into producing the intermediate form in this go-round.
Let's build the lazy form first and get some experience with it
before we decide if we need yet another kind of VACUUM.
To do that, I propose a free space map (FSM) kept in shared memory, which
will tell backends which pages of a relation have free space. Only if the
FSM shows no free space available will the relation be extended to insert
a new or updated tuple.
I assume that now is not a good time to bring up memory-mapped files? ;-}
Don't see the relevance exactly ...
Were you planning on just a free byte count, or something smaller? Dec/RDB
uses a nast system of DBA-defined thresholds for each storage area: 4 bits,
where 0=empty, and 1, 2 & 3 indicate above/below thresholds (3 is also
considered 'full'). The thresholds are usually set based on average record
sizes. In this day & age, I suspect a 1 byte percentage, or 2 byte count is
OK unless space is really a premium.
I had toyed with two different representations of the FSM:
1. Bitmap: one bit per page in the relation, set if there's an
"interesting" amount of free space in the page (exact threshold ???).
DEC's approach seems to be a generalization of this.
2. Page list: page number and number of free bytes. This is six bytes
per page represented; you could maybe compress it to 5 but I'm not sure
there's much point.
I went with #2 mainly because it adapts easily to being forced into a
limited amount of space (just forget the pages with least free space)
which is critical for a table to be kept in shared memory. A bitmap
would be less forgiving. #2 also gives you a better chance of going
to a page that actually has enough free space for your tuple, though
you'd still need to be prepared to find out that it no longer has
enough once you get to it. (Whereupon, you go back to the FSM, fix
the outdated entry, and keep searching.)
While on the subject of record keeping, it would be great if it was coded
to collect statistics about it's own operation for Jan's stats package.
Seems like a good idea, but I've seen no details yet about that package...
This seems to have the potential to create as many false FSM page entries
as there are backends. Is it really that expensive to lock the FSM table
entry, subtract a number, then unlock it?
Yes, if you are contending against N other backends to get that lock.
Remember the *whole* point of this design is to avoid locking as much
as possible. Too many trips to the FSM could throw away the performance
advantage.
Is it possible/worth adding a 'blocking notification' to the lock manager.
Then VACUUM could choose to terminate/restart when someone wants to do a
schema change. This is realy only relevant if the VACUUM will be prolonged.
Seems messier than it's worth ... the VACUUM might not be the only thing
holding off your schema update anyway, and regular transactions aren't
likely to pay any attention.
regards, tom lane
Tom Lane wrote:
Hiroshi Inoue <Inoue@tpf.co.jp> writes:
Tom Lane wrote:
And would the truncation occur that often in reality under
the scheme(without tuple movement) ?Probably not, per my comments to someone else. I'm not very concerned
about that, as long as we are able to recycle freed space within the
relation.
Agreed.
We could in fact move tuples if we wanted to --- it's not fundamentally
different from an UPDATE --- but then VACUUM becomes a transaction and
we have the WAL-log-traffic problem back again.
And it has been always the cause of bugs and innefficiency
of VACUUM IMHO.
regards,
Hiroshi Inoue