Visibility map, partial vacuums

Started by Heikki Linnakangasover 17 years ago53 messageshackers

heikki.linnakangas@enterprisedb.com

over 17 years ago

Here's finally my attempt at the visibility map, aka. the dead space
map. It's still work-in-progress, but it's time to discuss some design
details in detail. Patch attached, anyway, for reference.

Visibility Map is basically a bitmap, with one bit per heap page, with
'1' for pages that are known to contain only tuples that are visible to
everyone. Such pages don't need vacuuming, because there is no dead
tuples, and the information can also be used to skip visibility checks.
It should allow index-only-scans in the future, 8.5 perhaps, but that's
not part of this patch. The visibility map is stored in a new relation
fork, alongside the main data and the FSM.

Lazy VACUUM only needs to visit pages that are '0' in the visibility
map. This allows partial vacuums, where we only need to scan those parts
of the table that need vacuuming, plus all indexes.

To avoid having to update the visibility map every time a heap page is
updated, I have added a new flag to the heap page header,
PD_ALL_VISIBLE, which indicates the same thing as a set bit in the
visibility map: all tuples on the page are known to be visible to
everyone. When a page is modified, the visibility map only needs to be
updated if PD_ALL_VISIBLE was set. That should make the impact
unnoticeable for use cases with lots of updates, where the visibility
map doesn't help, as only the first update on page after a vacuum needs
to update the visibility map.

As a bonus, I'm using the PD_ALL_VISIBLE flag to skip visibility checks
in sequential scans. That seems to give a small 5-10% speedup on my
laptop, to a simple "SELECT COUNT(*) FROM foo" query, where foo is a
narrow table with just a single integer column, fitting in RAM.

The critical part of this patch is to keep the PD_ALL_VISIBLE flag and
the visibility map up-to-date, avoiding race conditions. An invariant is
maintained: if PD_ALL_VISIBLE flag is *not* set, the corresponding bit
in the visiblity map must also not be set. If PD_ALL_VISIBLE flag is
set, the bit in the visibility map can be set, or not.

To modify a page:
If PD_ALL_VISIBLE flag is set, the bit in the visibility map is cleared
first. The heap page is kept pinned, but not locked, while the
visibility map is updated. We want to avoid holding a lock across I/O,
even though the visibility map is likely to stay in cache. After the
visibility map has been updated, the page is exclusively locked and
modified as usual, and PD_ALL_VISIBLE flag is cleared before releasing
the lock.

To set the PD_ALL_VISIBLE flag, you must hold an exclusive lock on the
page, while you observe that all tuples on the page are visible to everyone.

To set the bit in the visibility map, you need to hold a cleanup lock on
the heap page. That keeps away other backends trying to clear the bit in
the visibility map at the same time. Note that you need to hold a lock
on the heap page to examine PD_ALL_VISIBLE, otherwise the cleanup lock
doesn't protect from the race condition.

That's how the patch works right now. However, there's a small
performance problem with the current approach: setting the
PD_ALL_VISIBLE flag must be WAL-logged. Otherwise, this could happen:
1. All tuples on a page become visible to everyone. The inserting
transaction committed, for example. A backend sees that and set
PD_ALL_VISIBLE
2. Vacuum comes along, and sees that there's no work to be done on the
page. It sets the bit in the visibility map.
3. The visibility map page is flushed to disk. The heap page is not, yet.
4. Crash

The bit in the visibility map is now set, but the corresponding
PD_ALL_VISIBLE flag is not, because it never made it to disk.

I'm avoiding that at the moment by only setting PD_ALL_VISIBLE as part
of a page prune operation, and forcing a WAL record to be written even
if no other work is done on the page. The downside of that is that it
can lead to a big increase in WAL traffic after a bulk load, for
example. The first vacuum after the bulk load would have to write a WAL
record for every heap page, even though there's no dead tuples.

One option would be to just ignore that problem for now, and not
WAL-log. As long as we don't use the visibility map for anything like
index-only-scans, it doesn't matter much if there's some bits set that
shouldn't be. It just means that VACUUM will skip some pages that need
vacuuming, but VACUUM FREEZE will eventually catch those. Given how
little time we have until commitfest and feature freeze, that's probably
the most reasonable thing to do. I'll follow up with other solutions to
that problem, but mainly for discussion for 8.5.

Another thing that does need to be fixed, is the way that the extension
and truncation of the visibility map is handled; that's broken in the
current patch. I started working on the patch a long time ago, before
the FSM rewrite was finished, and haven't gotten around fixing that part
yet. We already solved it for the FSM, so we could just follow that
pattern. The way we solved truncation in the FSM was to write a separate
WAL record with the new heap size, but perhaps we want to revisit that
decision, instead of adding again new code to write a third WAL record,
for truncation of the visibility map. smgrtruncate() writes a WAL record
of its own, if any full blocks are truncated away of the FSM, but we
needed a WAL record even if no full blocks are truncated from the FSM
file, because the "tail" of the last remaining FSM page, representing
the truncated away heap pages, still needs to cleared. Visibility map
has the same problem.

One proposal was to piggyback on the smgrtruncate() WAL-record, and call
FreeSpaceMapTruncateRel from smgr_redo(). I considered that ugly from a
modularity point of view; smgr.c shouldn't be calling higher-level
functions. But maybe it wouldn't be that bad, after all. Or, we could
remove WAL-logging from smgrtruncate() altogether, and move it to
RelationTruncate() or another higher-level function, and handle the
WAL-logging and replay there.

There's some side-effects of partial vacuums that also need to be fixed.
First of all, the tuple count stored in pg_class is now wrong: it only
includes tuples from the pages that are visited. VACUUM VERBOSE output
needs to be changed as well to reflect that only some pages were scanned.

Other TODOs
- performance testing, to ensure that there's no significant performance
penalty.
- should add a specialized version of visibilitymap_clear() for WAL
reaply, so that wouldn't have to rely so much on the fake relcache entries.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Tom Lane

tgl@sss.pgh.pa.us

over 17 years ago

In reply to: Heikki Linnakangas (#1)

Re: Visibility map, partial vacuums

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:

To modify a page:
If PD_ALL_VISIBLE flag is set, the bit in the visibility map is cleared
first. The heap page is kept pinned, but not locked, while the
visibility map is updated. We want to avoid holding a lock across I/O,
even though the visibility map is likely to stay in cache. After the
visibility map has been updated, the page is exclusively locked and
modified as usual, and PD_ALL_VISIBLE flag is cleared before releasing
the lock.

So after having determined that you will modify a page, you release the
ex lock on the buffer and then try to regain it later? Seems like a
really bad idea from here. What if it's no longer possible to do the
modification you intended?

To set the PD_ALL_VISIBLE flag, you must hold an exclusive lock on the
page, while you observe that all tuples on the page are visible to everyone.

That doesn't sound too good from a concurrency standpoint...

That's how the patch works right now. However, there's a small
performance problem with the current approach: setting the
PD_ALL_VISIBLE flag must be WAL-logged. Otherwise, this could happen:

I'm more concerned about *clearing* the bit being WAL-logged. That's
necessary for correctness.

regards, tom lane

Simon Riggs

simon@2ndQuadrant.com

over 17 years ago

In reply to: Heikki Linnakangas (#1)

Re: Visibility map, partial vacuums

On Mon, 2008-10-27 at 14:03 +0200, Heikki Linnakangas wrote:

One option would be to just ignore that problem for now, and not
WAL-log.

Probably worth skipping for now, since it will cause patch conflicts if
you do. Are there any other interactions with Hot Standby?

But it seems like we can sneak in an extra flag on a HEAP2_CLEAN record
to say "page is now all visible", without too much work.

Does the PD_ALL_VISIBLE flag need to be set at the same time as updating
the VM? Surely heapgetpage() could do a ConditionalLockBuffer exclusive
to set the block flag (unlogged), but just not update VM. Separating the
two concepts should allow the visibility check speed gain to more
generally available.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 17 years ago

In reply to: Simon Riggs (#3)

Re: Visibility map, partial vacuums

Simon Riggs wrote:

On Mon, 2008-10-27 at 14:03 +0200, Heikki Linnakangas wrote:

One option would be to just ignore that problem for now, and not
WAL-log.

Probably worth skipping for now, since it will cause patch conflicts if
you do. Are there any other interactions with Hot Standby?

But it seems like we can sneak in an extra flag on a HEAP2_CLEAN record
to say "page is now all visible", without too much work.

Hmm. Even if a tuple is visible to everyone on the master, it's not
necessarily yet visible to all the read-only transactions in the slave.

Does the PD_ALL_VISIBLE flag need to be set at the same time as updating
the VM? Surely heapgetpage() could do a ConditionalLockBuffer exclusive
to set the block flag (unlogged), but just not update VM. Separating the
two concepts should allow the visibility check speed gain to more
generally available.

Yes, that should be possible in theory. There's no version of
ConditionalLockBuffer() for conditionally upgrading a shared lock to
exclusive, but it should be possible in theory. I'm not sure if it would
be safe to set the PD_ALL_VISIBLE_FLAG while holding just a shared lock,
though. If it is, then we could do just that.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Tom Lane

tgl@sss.pgh.pa.us

over 17 years ago

In reply to: Heikki Linnakangas (#4)

Re: Visibility map, partial vacuums

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:

... I'm not sure if it would
be safe to set the PD_ALL_VISIBLE_FLAG while holding just a shared lock,
though. If it is, then we could do just that.

Seems like it must be safe. If you have shared lock on a page then no
one else could be modifying the page in a way that would falsify
PD_ALL_VISIBLE. You might have several processes concurrently try to
set the bit but that is safe (same situation as for hint bits).

The harder part is propagating the bit to the visibility map, but I
gather you intend to only allow VACUUM to do that?

regards, tom lane

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 17 years ago

In reply to: Tom Lane (#5)

Re: Visibility map, partial vacuums

Tom Lane wrote:

The harder part is propagating the bit to the visibility map, but I
gather you intend to only allow VACUUM to do that?

Yep.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 17 years ago

In reply to: Tom Lane (#2)

Re: Visibility map, partial vacuums

Tom Lane wrote:

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:

To modify a page:
If PD_ALL_VISIBLE flag is set, the bit in the visibility map is cleared
first. The heap page is kept pinned, but not locked, while the
visibility map is updated. We want to avoid holding a lock across I/O,
even though the visibility map is likely to stay in cache. After the
visibility map has been updated, the page is exclusively locked and
modified as usual, and PD_ALL_VISIBLE flag is cleared before releasing
the lock.

So after having determined that you will modify a page, you release the
ex lock on the buffer and then try to regain it later? Seems like a
really bad idea from here. What if it's no longer possible to do the
modification you intended?

In case of insert/update, you have to find a new target page. I put the
logic in RelationGetBufferForTuple(). In case of delete and update (old
page), the flag is checked and bit cleared just after pinning the
buffer, before doing anything else. (I note that that's not actually
what the patch is doing for heap_update, will fix..)

If we give up on the strict requirement that the bit in the visibility
map has to be cleared if the PD_ALL_VISIBLE flag on the page is not set,
then we could just update the visibility map after releasing the locks
on the heap pages. I think I'll do that for now, for simplicity.

To set the PD_ALL_VISIBLE flag, you must hold an exclusive lock on the
page, while you observe that all tuples on the page are visible to everyone.

That doesn't sound too good from a concurrency standpoint...

Well, no, but it's only done in VACUUM. And pruning. I implemented it as
a new loop that call HeapTupleSatisfiesVacuum on each tuple, and
checking that xmin is old enough for live tuples, but come to think of
it, we're already calling HeapTupleSatisfiesVacuum for every tuple on
the page during VACUUM, so it should be possible to piggyback on that by
restructuring the code.

That's how the patch works right now. However, there's a small
performance problem with the current approach: setting the
PD_ALL_VISIBLE flag must be WAL-logged. Otherwise, this could happen:

I'm more concerned about *clearing* the bit being WAL-logged. That's
necessary for correctness.

Yes, clearing the PD_ALL_VISIBLE flag always needs to be WAL-logged.
There's a new boolean field in xl_heap_insert/update/delete records
indicating if the operation cleared the flag. On replay, if the flag was
cleared, the bit in the visibility map is also cleared.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Simon Riggs

simon@2ndQuadrant.com

over 17 years ago

In reply to: Heikki Linnakangas (#4)

Re: Visibility map, partial vacuums

On Tue, 2008-10-28 at 14:57 +0200, Heikki Linnakangas wrote:

Simon Riggs wrote:

On Mon, 2008-10-27 at 14:03 +0200, Heikki Linnakangas wrote:

One option would be to just ignore that problem for now, and not
WAL-log.

Probably worth skipping for now, since it will cause patch conflicts if
you do. Are there any other interactions with Hot Standby?

But it seems like we can sneak in an extra flag on a HEAP2_CLEAN record
to say "page is now all visible", without too much work.

Hmm. Even if a tuple is visible to everyone on the master, it's not
necessarily yet visible to all the read-only transactions in the slave.

Never a problem. No query can ever see the rows removed by a cleanup
record, enforced by the recovery system.

Does the PD_ALL_VISIBLE flag need to be set at the same time as updating
the VM? Surely heapgetpage() could do a ConditionalLockBuffer exclusive
to set the block flag (unlogged), but just not update VM. Separating the
two concepts should allow the visibility check speed gain to more
generally available.

Yes, that should be possible in theory. There's no version of
ConditionalLockBuffer() for conditionally upgrading a shared lock to
exclusive, but it should be possible in theory. I'm not sure if it would
be safe to set the PD_ALL_VISIBLE_FLAG while holding just a shared lock,
though. If it is, then we could do just that.

To be honest, I'm more excited about your perf results for that than I
am about speeding up some VACUUMs.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 17 years ago

In reply to: Simon Riggs (#8)

Re: Visibility map, partial vacuums

Simon Riggs wrote:

On Tue, 2008-10-28 at 14:57 +0200, Heikki Linnakangas wrote:

Simon Riggs wrote:

On Mon, 2008-10-27 at 14:03 +0200, Heikki Linnakangas wrote:

One option would be to just ignore that problem for now, and not
WAL-log.

Probably worth skipping for now, since it will cause patch conflicts if
you do. Are there any other interactions with Hot Standby?

But it seems like we can sneak in an extra flag on a HEAP2_CLEAN record
to say "page is now all visible", without too much work.

Hmm. Even if a tuple is visible to everyone on the master, it's not
necessarily yet visible to all the read-only transactions in the slave.

Never a problem. No query can ever see the rows removed by a cleanup
record, enforced by the recovery system.

Yes, but there's a problem with recently inserted tuples:

1. A query begins in the slave, taking a snapshot with xmax = 100. So
the effects of anything more recent should not be seen.
2. Transaction 100 inserts a tuple in the master, and commits
3. A vacuum comes along. There's no other transactions running in the
master. Vacuum sees that all tuples on the page, including the one just
inserted, are visible to everyone, and sets PD_ALL_VISIBLE flag.
4. The change is replicated to the slave.
5. The query in the slave that began at step 1 looks at the page, sees
that the PD_ALL_VISIBLE flag is set. Therefore it skips the visibility
checks, and erroneously returns the inserted tuple.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#10

Simon Riggs

simon@2ndQuadrant.com

over 17 years ago

In reply to: Heikki Linnakangas (#1)

Re: Visibility map, partial vacuums

On Mon, 2008-10-27 at 14:03 +0200, Heikki Linnakangas wrote:

Lazy VACUUM only needs to visit pages that are '0' in the visibility
map. This allows partial vacuums, where we only need to scan those parts
of the table that need vacuuming, plus all indexes.

Just realised that this means we still have to visit each block of a
btree index with a cleanup lock.

That means the earlier idea of saying I don't need a cleanup lock if the
page is not in memory makes a lot more sense with a partial vacuum.

1. Scan all blocks in memory for the index (and so, don't do this unless
the index is larger than a certain % of shared buffers),
2. Start reading in new blocks until you've removed the correct number
of tuples
3. Work through the rest of the blocks checking that they are either in
shared buffers and we can get a cleanup lock, or they aren't in shared
buffers and so nobody has them pinned.

If you step (2) intelligently with regard to index correlation you might
not need to do much I/O at all, if any.

(1) has a good hit ratio because mostly only active tables will be
vacuumed so are fairly likely to be in memory.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

#11

Simon Riggs

simon@2ndQuadrant.com

over 17 years ago

In reply to: Heikki Linnakangas (#9)

Re: Visibility map, partial vacuums

On Tue, 2008-10-28 at 19:02 +0200, Heikki Linnakangas wrote:

Yes, but there's a problem with recently inserted tuples:

1. A query begins in the slave, taking a snapshot with xmax = 100. So
the effects of anything more recent should not be seen.
2. Transaction 100 inserts a tuple in the master, and commits
3. A vacuum comes along. There's no other transactions running in the
master. Vacuum sees that all tuples on the page, including the one just
inserted, are visible to everyone, and sets PD_ALL_VISIBLE flag.
4. The change is replicated to the slave.
5. The query in the slave that began at step 1 looks at the page, sees
that the PD_ALL_VISIBLE flag is set. Therefore it skips the visibility
checks, and erroneously returns the inserted tuple.

Yep. I was thinking about FSM and row removal. So PD_ALL_VISIBLE must be
separately settable on the standby. Another reason why it should be able
to be set without a VACUUM - since there will never be one on standby.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

#12

Tom Lane

tgl@sss.pgh.pa.us

over 17 years ago

In reply to: Simon Riggs (#10)

Re: Visibility map, partial vacuums

Simon Riggs <simon@2ndQuadrant.com> writes:

On Mon, 2008-10-27 at 14:03 +0200, Heikki Linnakangas wrote:

Lazy VACUUM only needs to visit pages that are '0' in the visibility
map. This allows partial vacuums, where we only need to scan those parts
of the table that need vacuuming, plus all indexes.

Just realised that this means we still have to visit each block of a
btree index with a cleanup lock.

Yes, and your proposal cannot fix that. Read "The Deletion Algorithm"
in nbtree/README, particularly the second paragraph.

regards, tom lane

#13

Tom Lane

tgl@sss.pgh.pa.us

over 17 years ago

In reply to: Heikki Linnakangas (#9)

Re: Visibility map, partial vacuums

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:

Yes, but there's a problem with recently inserted tuples:

1. A query begins in the slave, taking a snapshot with xmax = 100. So
the effects of anything more recent should not be seen.
2. Transaction 100 inserts a tuple in the master, and commits
3. A vacuum comes along. There's no other transactions running in the
master. Vacuum sees that all tuples on the page, including the one just
inserted, are visible to everyone, and sets PD_ALL_VISIBLE flag.
4. The change is replicated to the slave.
5. The query in the slave that began at step 1 looks at the page, sees
that the PD_ALL_VISIBLE flag is set. Therefore it skips the visibility
checks, and erroneously returns the inserted tuple.

But this is exactly equivalent to the problem with recently deleted
tuples: vacuum on the master might take actions that are premature with
respect to the status on the slave. Whatever solution we adopt for that
will work for this too.

regards, tom lane

#14

Simon Riggs

simon@2ndQuadrant.com

over 17 years ago

In reply to: Tom Lane (#12)

Re: Visibility map, partial vacuums

On Tue, 2008-10-28 at 13:58 -0400, Tom Lane wrote:

Simon Riggs <simon@2ndQuadrant.com> writes:

On Mon, 2008-10-27 at 14:03 +0200, Heikki Linnakangas wrote:

Lazy VACUUM only needs to visit pages that are '0' in the visibility
map. This allows partial vacuums, where we only need to scan those parts
of the table that need vacuuming, plus all indexes.

Just realised that this means we still have to visit each block of a
btree index with a cleanup lock.

Yes, and your proposal cannot fix that. Read "The Deletion Algorithm"
in nbtree/README, particularly the second paragraph.

Yes, understood. Please read the algorithm again. It does guarantee that
each block in the index has been checked to see if nobody is pinning it,
it just avoids performing I/O to prove that.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

#15

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 17 years ago

In reply to: Heikki Linnakangas (#1)

Re: Visibility map, partial vacuums

Heikki Linnakangas wrote:

Another thing that does need to be fixed, is the way that the extension
and truncation of the visibility map is handled; that's broken in the
current patch. I started working on the patch a long time ago, before
the FSM rewrite was finished, and haven't gotten around fixing that part
yet. We already solved it for the FSM, so we could just follow that
pattern. The way we solved truncation in the FSM was to write a separate
WAL record with the new heap size, but perhaps we want to revisit that
decision, instead of adding again new code to write a third WAL record,
for truncation of the visibility map. smgrtruncate() writes a WAL record
of its own, if any full blocks are truncated away of the FSM, but we
needed a WAL record even if no full blocks are truncated from the FSM
file, because the "tail" of the last remaining FSM page, representing
the truncated away heap pages, still needs to cleared. Visibility map
has the same problem.

One proposal was to piggyback on the smgrtruncate() WAL-record, and call
FreeSpaceMapTruncateRel from smgr_redo(). I considered that ugly from a
modularity point of view; smgr.c shouldn't be calling higher-level
functions. But maybe it wouldn't be that bad, after all. Or, we could
remove WAL-logging from smgrtruncate() altogether, and move it to
RelationTruncate() or another higher-level function, and handle the
WAL-logging and replay there.

In preparation for the visibility map patch, I revisited the truncation
issue, and hacked together a patch to piggyback the FSM truncation to
the main fork smgr truncation WAL record. I moved the WAL-logging from
smgrtruncate() to RelationTruncate(). There's a new flag to
RelationTruncate indicating whether the FSM should be truncated too, and
only one truncation WAL record is written for the operation.

That does seem cleaner than the current approach where the FSM writes a
separate WAL record just to clear the bits of the last remaining FSM
page. I had to move RelationTruncate() to smgr.c, because I don't think
a function in bufmgr.c should be doing WAL-logging. However,
RelationTruncate really doesn't belong in smgr.c either. Also, now that
smgrtruncate doesn't write its own WAL record, it doesn't seem right for
smgrcreate to be doing that either.

So, I think I'll take this one step forward, and move RelationTruncate()
to a new higher level file, e.g. src/backend/catalog/storage.c, and also
create a new RelationCreateStorage() function that calls smgrcreate(),
and move the WAL-logging from smgrcreate() to RelationCreateStorage().

So, we'll have two functions in a new file:

/* Create physical storage for a relation. If 'fsm' is true, an FSM fork
is also created */
RelationCreateStorage(Relation rel, bool fsm)
/* Truncate the relation to 'nblocks' blocks. If 'fsm' is true, the FSM
is also truncated */
RelationTruncate(Relation rel, BlockNumber nblocks, bool fsm)

The next question is whether the "pending rel deletion" stuff in smgr.c
should be moved to the new file too. It seems like it would belong there
better. That would leave smgr.c as a very thin wrapper around md.c

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#16

Bruce Momjian

bruce@momjian.us

over 17 years ago

In reply to: Heikki Linnakangas (#15)

Re: Visibility map, partial vacuums

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:

The next question is whether the "pending rel deletion" stuff in smgr.c should
be moved to the new file too. It seems like it would belong there better. That
would leave smgr.c as a very thin wrapper around md.c

Well it's just a switch, albeit with only one case, so I wouldn't expect it to
be much more than a thin wrapper.

If we had more storage systems it might be clearer what features were common
to all of them and could be hoisted up from md.c. I'm not clear there are any
though.

Actually I wonder if an entirely in-memory storage system would help with the
"temporary table" problem on systems where the kernel is too aggressive about
flushing file buffers or metadata.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's PostGIS support!

#17

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 17 years ago

In reply to: Heikki Linnakangas (#15)

Re: Visibility map, partial vacuums

Heikki Linnakangas wrote:

So, I think I'll take this one step forward, and move RelationTruncate()
to a new higher level file, e.g. src/backend/catalog/storage.c, and also
create a new RelationCreateStorage() function that calls smgrcreate(),
and move the WAL-logging from smgrcreate() to RelationCreateStorage().

So, we'll have two functions in a new file:

/* Create physical storage for a relation. If 'fsm' is true, an FSM fork
is also created */
RelationCreateStorage(Relation rel, bool fsm)
/* Truncate the relation to 'nblocks' blocks. If 'fsm' is true, the FSM
is also truncated */
RelationTruncate(Relation rel, BlockNumber nblocks, bool fsm)

The next question is whether the "pending rel deletion" stuff in smgr.c
should be moved to the new file too. It seems like it would belong there
better. That would leave smgr.c as a very thin wrapper around md.c

This new approach feels pretty good to me, attached is a patch to do
just that. Many of the functions formerly in smgr.c are now in
src/backend/catalog/storage.c, including all the WAL-logging and pending
rel deletion stuff. I kept their old names for now, though perhaps they
should be renamed now that they're above smgr level.

I also implemented Tom's idea of delaying creation of the FSM until it's
needed, not because of performance, but because it started to get quite
hairy to keep track of which relations should have a FSM and which
shouldn't. Creation of the FSM fork is now treated more like extending a
relation, as a non-WAL-logged operation, and it's up to freespace.c to
create the file when it's needed. There's no operation to explicitly
delete an individual fork of a relation, RelationCreateStorage only
creates the main fork, RelationDropStorage drops all forks, and
RelationTruncate truncates the FSM if and only if the FSM fork exists.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#18

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 17 years ago

In reply to: Heikki Linnakangas (#17)

Re: Visibility map, partial vacuums

I committed the changes to FSM truncation yesterday, that helps with the
truncation of the visibility map as well. Attached is an updated
visibility map patch.

There's two open issues:

1. The bits in the visibility map are set in the 1st phase of lazy
vacuum. That works, but it means that after a delete or update, it takes
two vacuums until the bit in the visibility map is set. The first vacuum
removes the dead tuple, and only the second sees that there's no dead
tuples and sets the bit.

2. Should modify the output of VACUUM VERBOSE to say how many pages were
actually scanned. What other information is relevant, or is no longer
relevant, with partial vacuums.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#19

Tom Lane

tgl@sss.pgh.pa.us

over 17 years ago

In reply to: Heikki Linnakangas (#18)

Re: Visibility map, partial vacuums

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:

I committed the changes to FSM truncation yesterday, that helps with the
truncation of the visibility map as well. Attached is an updated
visibility map patch.

I looked over this patch a bit ...

1. The bits in the visibility map are set in the 1st phase of lazy
vacuum. That works, but it means that after a delete or update, it takes
two vacuums until the bit in the visibility map is set. The first vacuum
removes the dead tuple, and only the second sees that there's no dead
tuples and sets the bit.

I think this is probably not a big issue really. The point of this change
is to optimize things for pages that are static over the long term; one
extra vacuum cycle before the page is deemed static doesn't seem like a
problem. You could even argue that this saves I/O because we don't set
the bit (and perhaps later have to clear it) until we know that the page
has stayed static across a vacuum cycle and thus has a reasonable
probability of continuing to do so.

A possible problem is that if a relation is filled all in one shot,
autovacuum would trigger a single vacuum cycle on it and then never have
a reason to trigger another; leading to the bits never getting set (or
at least not till an antiwraparound vacuum occurs). We might want to
tweak autovac so that an extra vacuum cycle occurs in this case. But
I'm not quite sure what a reasonable heuristic would be.

Some other points:

* ISTM that the patch is designed on the plan that the PD_ALL_VISIBLE
page header flag *must* be correct, but it's really okay if the backing
map bit *isn't* correct --- in particular we don't trust the map bit
when performing antiwraparound vacuums. This isn't well documented.

* Also, I see that vacuum has a provision for clearing an incorrectly
set PD_ALL_VISIBLE flag, but shouldn't it fix the map too?

* It would be good if the visibility map fork were never created until
there is occasion to set a bit in it; this would for instance typically
mean that temp tables would never have one. I think that
visibilitymap.c doesn't get this quite right --- in particular
vm_readbuf seems willing to create/extend the fork whether its extend
argument is true or not, so it looks like an inquiry operation would
cause the map fork to be created. It should be possible to act as
though a nonexistent fork just means "all zeroes".

* heap_insert's all_visible_cleared variable doesn't seem to get
initialized --- didn't your compiler complain?

* You missed updating SizeOfHeapDelete and SizeOfHeapUpdate

regards, tom lane

#20

Jeff Davis

pgsql@j-davis.com

over 17 years ago

In reply to: Tom Lane (#19)

Re: Visibility map, partial vacuums

On Sun, 2008-11-23 at 14:05 -0500, Tom Lane wrote:

A possible problem is that if a relation is filled all in one shot,
autovacuum would trigger a single vacuum cycle on it and then never have
a reason to trigger another; leading to the bits never getting set (or
at least not till an antiwraparound vacuum occurs). We might want to
tweak autovac so that an extra vacuum cycle occurs in this case. But
I'm not quite sure what a reasonable heuristic would be.

This would only be an issue if using the visibility map for things other
than partial vacuum (e.g. index-only scan), right? If we never do
another VACUUM, we don't need partial vacuum.

Regards,
Jeff Davis

#21

Tom Lane

tgl@sss.pgh.pa.us

over 17 years ago

In reply to: Jeff Davis (#20)

#22

Matthew T. O'Connor

matthew@zeut.net

over 17 years ago

In reply to: Tom Lane (#21)

#23

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 17 years ago

In reply to: Tom Lane (#21)

#24

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 17 years ago

In reply to: Tom Lane (#19)

#25

Tom Lane

tgl@sss.pgh.pa.us

over 17 years ago

In reply to: Heikki Linnakangas (#24)

#26

Tom Lane

tgl@sss.pgh.pa.us

over 17 years ago

In reply to: Heikki Linnakangas (#23)

#27

Bruce Momjian

bruce@momjian.us

over 17 years ago

In reply to: Tom Lane (#25)

#28

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 17 years ago

In reply to: Bruce Momjian (#27)

#29

Bruce Momjian

bruce@momjian.us

over 17 years ago

In reply to: Heikki Linnakangas (#28)

#30

Tom Lane

tgl@sss.pgh.pa.us

over 17 years ago

In reply to: Bruce Momjian (#27)

#31

Jim Nasby

Jim.Nasby@BlueTreble.com

over 17 years ago

In reply to: Tom Lane (#21)

#32

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 17 years ago

In reply to: Tom Lane (#25)

#33

Tom Lane

tgl@sss.pgh.pa.us

over 17 years ago

In reply to: Heikki Linnakangas (#32)

#34

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 17 years ago

In reply to: Tom Lane (#33)

#35

Tom Lane

tgl@sss.pgh.pa.us

over 17 years ago

In reply to: Heikki Linnakangas (#34)

#36

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 17 years ago

In reply to: Tom Lane (#35)

#37

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 17 years ago

In reply to: Tom Lane (#33)

#38

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 17 years ago

In reply to: Heikki Linnakangas (#37)

#39

Alvaro Herrera

alvherre@2ndquadrant.com

over 17 years ago

In reply to: Heikki Linnakangas (#15)

#40

Bruce Momjian

bruce@momjian.us

over 17 years ago

In reply to: Heikki Linnakangas (#1)

#41

Magnus Hagander

magnus@hagander.net

over 17 years ago

In reply to: Bruce Momjian (#40)

#42

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 17 years ago

In reply to: Bruce Momjian (#40)

#43

Bruce Momjian

bruce@momjian.us

over 17 years ago

In reply to: Heikki Linnakangas (#42)

#44

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 17 years ago

In reply to: Bruce Momjian (#43)

#45

Bruce Momjian

bruce@momjian.us

over 17 years ago

In reply to: Bruce Momjian (#43)

#46

Bruce Momjian

bruce@momjian.us

over 17 years ago

In reply to: Bruce Momjian (#40)

#47

Andrew Dunstan

andrew@dunslane.net

over 17 years ago

In reply to: Bruce Momjian (#46)

#48

Bruce Momjian

bruce@momjian.us

over 17 years ago

In reply to: Andrew Dunstan (#47)

#49

Bruce Momjian

bruce@momjian.us

over 17 years ago

In reply to: Bruce Momjian (#46)

#50

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 17 years ago

In reply to: Bruce Momjian (#49)

#51

Bruce Momjian

bruce@momjian.us

over 17 years ago

In reply to: Heikki Linnakangas (#50)

#52

Bruce Momjian

bruce@momjian.us

over 17 years ago

In reply to: Bruce Momjian (#49)

#53

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 17 years ago

In reply to: Bruce Momjian (#51)

Visibility map, partial vacuums

Attachments:

Attachments:

Attachments: