crash-safe visibility map, take three
Last week, I posted a couple of possible designs for making the
visibility map crash-safe, which did not elicit much comment. Since
this is an important prerequisite to index-only scans, I'm trying
again.
http://archives.postgresql.org/pgsql-hackers/2010-11/msg01474.php
http://archives.postgresql.org/pgsql-hackers/2010-11/msg01529.php
Here's a third possible design.
Instead of representing each page with a single bit in the visibility
map, use two bits. One bit indicates whether all tuples on the page
are all-visible (same as the current bit) - call this VM_ALL_VISIBLE.
The other bit is only used during VACUUM and indicates whether VACUUM
is trying to set the all-visible bit - call this
VM_BECOMING_ALL_VISIBLE. We observe the rule that any operation that
clears PD_ALL_VISIBLE must clear both the VM_ALL_VISIBLE and
VM_BECOMING_ALL_VISIBLE bits for that page in the visibility map.
VACUUM precedes as follows:
1. Pin each visibility map page. If any VM_BECOMING_ALL_VISIBLE bits
are set, take the exclusive content lock for long enough to clear
them.
2. Scan the heap. When a page is observed to be all-visible, set
VM_BECOMING_ALL_VISIBLE and PD_ALL_VISIBLE.
3. Loop over shared buffers and write out every page to the OS which
belongs to the target relation, was marked all-visible in step 2, and
is still dirty. Note that this may require a WAL flush.
4. fsync() the heap.
5. Pin each visibility map page. If any VM_BECOMING_ALL_VISIBLE bits
are set, take the exclusive content lock, clear each such bit, set the
corresponding VM_ALL_VISIBLE bits and XLOG the page.
One might actually want to do steps 2-5 incrementally, in 1GB chunks,
so that you don't fsync() too much of the relation all at once.
If you tilt your head just right, the recurring problem in all of this
is that the heap page and the visibility map page can go to disk in
either order, and we have no way of knowing which one. A more radical
solution to this problem (and, thus, a fourth possible design) would
be to add a field to the buffer descriptor allowing one page to "wire"
another page into shared buffers. If the value is >0, it's the number
of a buffer it's currently wiring. If the value is <0, it's the
number of other buffers that have wired this buffer. A buffer both
wire another buffer and itself be wired at the same time. If the
value is =0, everything's normal. To make this work, though, you'd
have to complicate the checkpoint logic pretty considerably - make
sure all the unwired buffers are written and fsync'd first, thus
unwiring the remaining ones to be written and fsync'd in a second
pass; and there are also possible problems with very small relations,
where the number of wired buffers might grow to an uncomfortably high
percentage of the total. Waving my hands at all this complexity, you
could then make the heap pages wire the visibility map pages.
I can't say I'm totally in love with any of these designs. Anyone
else have any ideas, or any opinions about which one is best?
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On 30.11.2010 06:57, Robert Haas wrote:
I can't say I'm totally in love with any of these designs. Anyone
else have any ideas, or any opinions about which one is best?
Well, the design I've been pondering goes like this:
At vacuum:
1. Write an "intent" XLOG record listing a chunk of visibility map bits
that are not currently set, that we are going to try to set. A chunk of
say 100 bits would be about right.
2. Scan the 100 heap pages as we currently do, setting the visibility
map bits as we go.
3. After the scan, lock the visibility map page, check which of the bits
that we set in step 2 are still set (concurrent updates might've cleared
some), and write a final XLOG record listing the set bits. This step
isn't necessary for correctness, BTW, but without it you lose all the
set bits if you crash before next checkpoint.
At replay, when we see the intent XLOG record, clear all the bits listed
in it. This ensures that if we crashed and some of the visibility map
bits were flushed to disk but the corresponding changes to the heap
pages were not, the bits are cleared. When we see the final XLOG record,
we set the bits.
Some care is needed with checkpoints. Setting visibility map bits in
step 2 is safe because crash recovery will replay the intent XLOG record
and clear any incorrectly set bits. But if a checkpoint has happened
after the intent XLOG record was written, that's not true. This can be
avoided by checking RedoRecPtr in step 2, and writing a new intent XLOG
record if it has changed since the last intent XLOG record was written.
There's a small race condition in the way a visibility map bit is
currently cleared. When a heap page is updated, it is locked, the update
is WAL-logged, and the lock is released. The visibility map page is
updated only after that. If the final vacuum XLOG record is written just
after updating the heap page, but before the visibility map bit is
cleared, replaying the final XLOG record will set a bit that should not
have been set.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
On Mon, Nov 29, 2010 at 9:57 PM, Robert Haas <robertmhaas@gmail.com> wrote:
1. Pin each visibility map page. If any VM_BECOMING_ALL_VISIBLE bits
are set, take the exclusive content lock for long enough to clear
them.
I wonder what the performance hit will be to workloads with contention
and if this feature should be optional.
--
Rob Wultsch
wultsch@gmail.com
On Tue, Nov 30, 2010 at 2:34 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
Some care is needed with checkpoints. Setting visibility map bits in step 2
is safe because crash recovery will replay the intent XLOG record and clear
any incorrectly set bits. But if a checkpoint has happened after the intent
XLOG record was written, that's not true. This can be avoided by checking
RedoRecPtr in step 2, and writing a new intent XLOG record if it has changed
since the last intent XLOG record was written.
It seems like you'll need to hold some kind of lock between the time
you examine RedoRecPtr and the time you actually examine the bit.
WALInsertLock in shared mode, maybe?
There's a small race condition in the way a visibility map bit is currently
cleared. When a heap page is updated, it is locked, the update is
WAL-logged, and the lock is released. The visibility map page is updated
only after that. If the final vacuum XLOG record is written just after
updating the heap page, but before the visibility map bit is cleared,
replaying the final XLOG record will set a bit that should not have been
set.
Well, if that final XLOG record isn't necessary for correctness
anyway, the obvious thing to do seems to be - don't write it. Crashes
are not so common that loss of even a full hour's visibility map bits
in the event that we have one seems worth killing ourselves over. And
not everybody sets checkpoint_timeout to an hour, and not all
checkpoints are triggered by checkpoint_timeout, and not all crashes
happen just before it expires. Seems like we might be better off
writing that much less WAL.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
On 30.11.2010 06:57, Robert Haas wrote:
I can't say I'm totally in love with any of these designs. Anyone
else have any ideas, or any opinions about which one is best?
Well, the design I've been pondering goes like this:
Wouldn't it be easier and more robust to just consider VM bit changes to
be part of the WAL-logged actions? That would include updating LSNs on
VM pages and flushing VM pages to disk during checkpoint based on their
LSN values. All of these other schemes seem too complicated and not
provably correct.
Of course, that'd mean doing the bit changes inside the critical
sections for the related actions, so it's not a trivial change
code-wise, but neither are these other ideas.
regards, tom lane
On 30.11.2010 17:32, Robert Haas wrote:
On Tue, Nov 30, 2010 at 2:34 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:Some care is needed with checkpoints. Setting visibility map bits in step 2
is safe because crash recovery will replay the intent XLOG record and clear
any incorrectly set bits. But if a checkpoint has happened after the intent
XLOG record was written, that's not true. This can be avoided by checking
RedoRecPtr in step 2, and writing a new intent XLOG record if it has changed
since the last intent XLOG record was written.It seems like you'll need to hold some kind of lock between the time
you examine RedoRecPtr and the time you actually examine the bit.
WALInsertLock in shared mode, maybe?
It's enough to hold an exclusive lock on the visibility map page. You
have to set the bit first, and then check RedoRecPtr, and if it changed,
write the XLOG record before releasing the lock. If RedoRecPtr changes
any time before we check RedoRecPtr, we'll write the XLOG record so
we're safe. If it changes after that, we're safe because the checkpoint
will flush the updated heap page and visibility map page.
There's a small race condition in the way a visibility map bit is currently
cleared. When a heap page is updated, it is locked, the update is
WAL-logged, and the lock is released. The visibility map page is updated
only after that. If the final vacuum XLOG record is written just after
updating the heap page, but before the visibility map bit is cleared,
replaying the final XLOG record will set a bit that should not have been
set.Well, if that final XLOG record isn't necessary for correctness
anyway, the obvious thing to do seems to be - don't write it. Crashes
are not so common that loss of even a full hour's visibility map bits
in the event that we have one seems worth killing ourselves over. And
not everybody sets checkpoint_timeout to an hour, and not all
checkpoints are triggered by checkpoint_timeout, and not all crashes
happen just before it expires. Seems like we might be better off
writing that much less WAL.
Yeah, possibly. It also means that the set bits will not propagate to
standby servers, though.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
On Tue, Nov 30, 2010 at 10:38 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
On 30.11.2010 06:57, Robert Haas wrote:
I can't say I'm totally in love with any of these designs. Anyone
else have any ideas, or any opinions about which one is best?Well, the design I've been pondering goes like this:
Wouldn't it be easier and more robust to just consider VM bit changes to
be part of the WAL-logged actions? That would include updating LSNs on
VM pages and flushing VM pages to disk during checkpoint based on their
LSN values. All of these other schemes seem too complicated and not
provably correct.
What WAL-logged actions?
The problem case is where a page has no tuples or line pointers that
need to be removed, and all we need to do is mark it all-visible. We
don't current WAL-log anything in that case.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On 30.11.2010 17:38, Tom Lane wrote:
Heikki Linnakangas<heikki.linnakangas@enterprisedb.com> writes:
On 30.11.2010 06:57, Robert Haas wrote:
I can't say I'm totally in love with any of these designs. Anyone
else have any ideas, or any opinions about which one is best?Well, the design I've been pondering goes like this:
Wouldn't it be easier and more robust to just consider VM bit changes to
be part of the WAL-logged actions? That would include updating LSNs on
VM pages and flushing VM pages to disk during checkpoint based on their
LSN values. All of these other schemes seem too complicated and not
provably correct.
The vm bit can be set once all the tuples on the page become visible to
everyone. There is no WAL-logged action at that point we could piggyback on.
Clearing the bit is already handled like that - replay of heap
insert/update/delete records clear the visibility map bit.
Of course, that'd mean doing the bit changes inside the critical
sections for the related actions, so it's not a trivial change
code-wise, but neither are these other ideas.
Yeah, I'm not terribly excited about any of these schemes. The "intent"
record seems like the simplest one, but even that is quite different
from the traditional WAL-logging we do that it makes me slightly nervous.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
On Tue, Nov 30, 2010 at 10:43 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
It seems like you'll need to hold some kind of lock between the time
you examine RedoRecPtr and the time you actually examine the bit.
WALInsertLock in shared mode, maybe?It's enough to hold an exclusive lock on the visibility map page. You have
to set the bit first, and then check RedoRecPtr, and if it changed, write
the XLOG record before releasing the lock. If RedoRecPtr changes any time
before we check RedoRecPtr, we'll write the XLOG record so we're safe. If it
changes after that, we're safe because the checkpoint will flush the updated
heap page and visibility map page.
Brilliant. I assume that we need to call GetRedoRecPtr() after taking
the exclusive lock on the page, though?
Yeah, possibly. It also means that the set bits will not propagate to
standby servers, though.
That's definitely sucky, but in some ways it would be more complicated
if they did, because I don't think all-visible on the master implies
all-visible on the standby.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Here's one more idea:
The trivial solution to this is to WAL-log setting the visibility map
bit, like we WAL-log any other operation. Lock the heap page, lock the
visibility map page, write WAL-record, and release locks. That works,
but the problem is that it creates quite a lot of new WAL traffic.
We could reduce the WAL traffic by simply updating multiple pages at a
time. Lock N pages, lock the visibility map page, write one WAL record,
and release locks. If N=10, for example, we only need to WAL-log a
couple of bytes per page, so the WAL volume should be acceptable. The
downside is that you need to keep more pages locked at the same time,
but maybe that's not too bad.
This wouldn't require anything special, which means fewer hard-to-debug
visibility & recovery bugs.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
On 30.11.2010 17:38, Tom Lane wrote:
Wouldn't it be easier and more robust to just consider VM bit changes to
be part of the WAL-logged actions? That would include updating LSNs on
VM pages and flushing VM pages to disk during checkpoint based on their
LSN values. All of these other schemes seem too complicated and not
provably correct.
The vm bit can be set once all the tuples on the page become visible to
everyone. There is no WAL-logged action at that point we could piggyback on.
So you start emitting a WAL entry for the act of setting the VM bit
(and I guess the page header hint bit would be included in that too).
Yeah, I'm not terribly excited about any of these schemes. The "intent"
record seems like the simplest one, but even that is quite different
from the traditional WAL-logging we do that it makes me slightly nervous.
I'm not convinced it works at all. Consider write intent record,
checkpoint, set bit, crash before completing vacuum. There will be
no second intent record at which you could clean up if things are
inconsistent.
regards, tom lane
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
The trivial solution to this is to WAL-log setting the visibility map
bit, like we WAL-log any other operation. Lock the heap page, lock the
visibility map page, write WAL-record, and release locks. That works,
but the problem is that it creates quite a lot of new WAL traffic.
How much is "quite a lot"? Do we have any real reason to think that
this solution is unacceptable performance-wise?
I'd also suggest that if you want to prevent torn-page syndrome on VM
pages (and if you want to rely on their LSN values, you do) then you
probably don't have any choice anyway. VM pages will have to adhere to
the same write-full-page-on-first-mod-after-checkpoint rule as any other
page. I'd guess that this will swamp any savings from cutesy schemes
for reducing the number of WAL records.
We could reduce the WAL traffic by simply updating multiple pages at a
time. Lock N pages, lock the visibility map page, write one WAL record,
and release locks.
I don't think that will work, because you have to hold the lock on a
page from the time you check that it's all-visible to the time you apply
the update. The loss of concurrency against updates would be pretty
bad, and I think you'd be creating significant risk of deadlocks from
holding multiple buffer locks at once.
regards, tom lane
On Tue, Nov 30, 2010 at 11:16 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
How much is "quite a lot"? Do we have any real reason to think that
this solution is unacceptable performance-wise?
Well, let's imagine a 1GB insert-only table. It has 128K pages. If
you XLOG setting the bit on each page, you'll need to write 128K WAL
records, each containing a 12-byte relfilenode and a 4-byte block
offset, for a total of 16 bytes of WAL per page, thus 2MB of WAL.
But you did just dirty a gigabyte of data.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Tue, Nov 30, 2010 at 11:22 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Tue, Nov 30, 2010 at 11:16 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
How much is "quite a lot"? Do we have any real reason to think that
this solution is unacceptable performance-wise?Well, let's imagine a 1GB insert-only table. It has 128K pages. If
you XLOG setting the bit on each page, you'll need to write 128K WAL
records, each containing a 12-byte relfilenode and a 4-byte block
offset, for a total of 16 bytes of WAL per page, thus 2MB of WAL.But you did just dirty a gigabyte of data.
Oh, but it's worse than that. When you XLOG a WAL record for each of
those pages, you're going to trigger full-page writes for all of them.
So now you've turned 1GB of data to write into 2+ GB of data to
write.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On 30.11.2010 18:22, Robert Haas wrote:
On Tue, Nov 30, 2010 at 11:16 AM, Tom Lane<tgl@sss.pgh.pa.us> wrote:
How much is "quite a lot"? Do we have any real reason to think that
this solution is unacceptable performance-wise?Well, let's imagine a 1GB insert-only table. It has 128K pages. If
you XLOG setting the bit on each page, you'll need to write 128K WAL
records, each containing a 12-byte relfilenode and a 4-byte block
offset, for a total of 16 bytes of WAL per page, thus 2MB of WAL.
Plus WAL headers, I think it's something like 32 or 40 bytes of WAL per
page.
But you did just dirty a gigabyte of data.
Good point.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
On 30.11.2010 18:10, Tom Lane wrote:
Heikki Linnakangas<heikki.linnakangas@enterprisedb.com> writes:
Yeah, I'm not terribly excited about any of these schemes. The "intent"
record seems like the simplest one, but even that is quite different
from the traditional WAL-logging we do that it makes me slightly nervous.I'm not convinced it works at all. Consider write intent record,
checkpoint, set bit, crash before completing vacuum. There will be
no second intent record at which you could clean up if things are
inconsistent.
That's why you need to check the RedoRecPtr when you set the bit. If it
has changed, ie. a checkpoint has happened, the set bit step will write
a new intent record.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
Robert Haas <robertmhaas@gmail.com> writes:
Oh, but it's worse than that. When you XLOG a WAL record for each of
those pages, you're going to trigger full-page writes for all of them.
So now you've turned 1GB of data to write into 2+ GB of data to
write.
No, because only the first mod of each VM page would trigger a full page
write, at least assuming a reasonable ordering of the operations.
regards, tom lane
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
On 30.11.2010 18:10, Tom Lane wrote:
I'm not convinced it works at all. Consider write intent record,
checkpoint, set bit, crash before completing vacuum. There will be
no second intent record at which you could clean up if things are
inconsistent.
That's why you need to check the RedoRecPtr when you set the bit. If it
has changed, ie. a checkpoint has happened, the set bit step will write
a new intent record.
Oh, you explained the proposal poorly then. I thought you meant recheck
and write another intent record just once, immediately before sending
the final xlog record.
It still seems rickety and not clearly correct, especially when you
start thinking about all the other constraints we have on xlog behavior
(eg, does this work while taking a base backup).
regards, tom lane
Robert Haas <robertmhaas@gmail.com> writes:
That's definitely sucky, but in some ways it would be more complicated
if they did, because I don't think all-visible on the master implies
all-visible on the standby.
Ouch. That seems like it could shoot down all these proposals. There
definitely isn't any way to make VM crash-safe if there is no WAL-driven
mechanism for setting the bits.
I guess what we need is a way to delay the application of such a WAL
record on the slave until it's safe, which means the record also has to
carry some indication of the youngest XMIN on the page.
regards, tom lane
On Tue, Nov 30, 2010 at 11:33 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Robert Haas <robertmhaas@gmail.com> writes:
Oh, but it's worse than that. When you XLOG a WAL record for each of
those pages, you're going to trigger full-page writes for all of them.
So now you've turned 1GB of data to write into 2+ GB of data to
write.No, because only the first mod of each VM page would trigger a full page
write, at least assuming a reasonable ordering of the operations.
I'm not worried about the full-page writes from updating the
visibility map - I'm worried about the full-page writes from updating
the heap. It doesn't matter a whit if we fail to set a bit in the
visibility map. What matters is if we DO set the bit in the visibility
map but FAIL TO set the bit in the heap, because then a subsequent
update to the heap page won't check the visibility map and clear the
bit. The *heap* updates are the ones that have to be guaranteed to
make it to disk.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company