Race condition between hot standby and restoring a FPW
There's a race condition between a backend running queries in hot
standby mode, and restoring a full-page image from a WAL record. It's
present in all supported versions.
RestoreBackupBlockContents does this:
buffer = XLogReadBufferExtended(bkpb.node, bkpb.fork, bkpb.block,
RBM_ZERO);
Assert(BufferIsValid(buffer));
if (get_cleanup_lock)
LockBufferForCleanup(buffer);
else
LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
If the page is not in buffer cache yet, and a backend reads and locks
the buffer after the above XLogReadBufferExtended call has zeroed it,
but before it has locked it, the backend sees an empty page.
The principle of fixing that is straightforward: the zeroed page should
not be visible to others, even momentarily. Unfortunately, I think
that's going to require an API change to ReadBufferExtended(RBM_ZERO) :-(.
I can think of two ways to fix this:
1. Have ReadBufferExtended lock the page in RBM_ZERO mode, before
returning it. That makes the API inconsistent, as the function would
sometimes lock the page, and sometimes not.
2. When ReadBufferExtended doesn't find the page in cache, it returns
the buffer in !BM_VALID state (i.e. still in I/O in-progress state).
Require the caller to call a second function, after locking the page, to
finish the I/O.
Anyone have a better idea?
- Heikki
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Heikki Linnakangas <hlinnakangas@vmware.com> writes:
There's a race condition between a backend running queries in hot
standby mode, and restoring a full-page image from a WAL record. It's
present in all supported versions.
I can think of two ways to fix this:
1. Have ReadBufferExtended lock the page in RBM_ZERO mode, before
returning it. That makes the API inconsistent, as the function would
sometimes lock the page, and sometimes not.
Ugh.
2. When ReadBufferExtended doesn't find the page in cache, it returns
the buffer in !BM_VALID state (i.e. still in I/O in-progress state).
Require the caller to call a second function, after locking the page, to
finish the I/O.
Not great either. What about an RBM_NOERROR mode that is like RBM_ZERO
in terms of handling error conditions, but does not forcibly zero the page
if it's already valid?
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Nov 12, 2014 at 7:39 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
2. When ReadBufferExtended doesn't find the page in cache, it returns the
buffer in !BM_VALID state (i.e. still in I/O in-progress state). Require the
caller to call a second function, after locking the page, to finish the I/O.
This seems like a reasonable approach.
If you tilt your head the right way, zeroing a page and restoring a
backup block are the same thing: either way, you want to "read" the
block into shared buffers without actually reading it, so that you can
overwrite the prior contents with something else. So, you could fix
this by adding a new mode, RBM_OVERWRITE, and passing the new page
contents as an additional argument to ReadBufferExtended, which would
then memcpy() that data into place where RBM_ZERO calls MemSet() to
zero it.
I'm not sure whether we want to complicate the API that way, though.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 11/12/2014 04:56 PM, Tom Lane wrote:
Heikki Linnakangas <hlinnakangas@vmware.com> writes:
There's a race condition between a backend running queries in hot
standby mode, and restoring a full-page image from a WAL record. It's
present in all supported versions.I can think of two ways to fix this:
1. Have ReadBufferExtended lock the page in RBM_ZERO mode, before
returning it. That makes the API inconsistent, as the function would
sometimes lock the page, and sometimes not.Ugh.
2. When ReadBufferExtended doesn't find the page in cache, it returns
the buffer in !BM_VALID state (i.e. still in I/O in-progress state).
Require the caller to call a second function, after locking the page, to
finish the I/O.Not great either. What about an RBM_NOERROR mode that is like RBM_ZERO
in terms of handling error conditions, but does not forcibly zero the page
if it's already valid?
Isn't that exactly what RBM_ZERO_ONERROR does?
Anyway, you don't want to read the page from disk, just to check if it's
already valid. We stopped doing that in 8.2 (commit
8c3cc86e7b688b0efe5ec6ce4f4342c2883b1db5), and it gave a big speedup to
recovery.
(Note that when the page is already in the buffer-cache, RBM_ZERO
already doesn't zero the page. So this race condition only happens when
the page isn't in the buffer cache yet).
- Heikki
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Heikki Linnakangas <hlinnakangas@vmware.com> writes:
On 11/12/2014 04:56 PM, Tom Lane wrote:
Not great either. What about an RBM_NOERROR mode that is like RBM_ZERO
in terms of handling error conditions, but does not forcibly zero the page
if it's already valid?
Anyway, you don't want to read the page from disk, just to check if it's
already valid.
Oh, good point.
(Note that when the page is already in the buffer-cache, RBM_ZERO
already doesn't zero the page. So this race condition only happens when
the page isn't in the buffer cache yet).
Right.
On reconsideration I think the "RBM_ZERO returns page already locked"
alternative may be the less ugly. That has the advantage that any code
that doesn't get updated will fail clearly and reliably. With the other
thing, you need some additional back channel for the caller to know
whether it must complete the I/O, and it's not obvious what will happen
if it fails to do that (except that it will be bad).
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 11/12/2014 05:08 PM, Robert Haas wrote:
On Wed, Nov 12, 2014 at 7:39 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:2. When ReadBufferExtended doesn't find the page in cache, it returns the
buffer in !BM_VALID state (i.e. still in I/O in-progress state). Require the
caller to call a second function, after locking the page, to finish the I/O.This seems like a reasonable approach.
If you tilt your head the right way, zeroing a page and restoring a
backup block are the same thing: either way, you want to "read" the
block into shared buffers without actually reading it, so that you can
overwrite the prior contents with something else. So, you could fix
this by adding a new mode, RBM_OVERWRITE, and passing the new page
contents as an additional argument to ReadBufferExtended, which would
then memcpy() that data into place where RBM_ZERO calls MemSet() to
zero it.
Yes, that would be quite a clean API. However, there's a problem with
locking, when the redo routine modifies multiple pages. Currently, you
lock the page first, and replace the page with the new contents while
holding the lock. With RBM_OVERWRITE, the new page contents would sneak
into the buffer before RestoreBackupBlock has acquired the lock on the
page, and another backend might pin and lock the page before
RestoreBackupBlock does. The page contents would be valid, but they
might not be consistent with other buffers yet. The redo routine might
be doing an atomic operation that spans multiple pages, by holding the
locks on all the pages until it's finished with all the changes, but the
backend would see a partial result.
- Heikki
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 11/12/2014 05:20 PM, Tom Lane wrote:
Heikki Linnakangas <hlinnakangas@vmware.com> writes:
On 11/12/2014 04:56 PM, Tom Lane wrote:
Not great either. What about an RBM_NOERROR mode that is like RBM_ZERO
in terms of handling error conditions, but does not forcibly zero the page
if it's already valid?Anyway, you don't want to read the page from disk, just to check if it's
already valid.Oh, good point.
(Note that when the page is already in the buffer-cache, RBM_ZERO
already doesn't zero the page. So this race condition only happens when
the page isn't in the buffer cache yet).Right.
On reconsideration I think the "RBM_ZERO returns page already locked"
alternative may be the less ugly. That has the advantage that any code
that doesn't get updated will fail clearly and reliably.
Yeah, I'm leaning to that approach as well. It's made more ugly by the
fact that you sometimes need a cleanup lock on the buffer, so the caller
will somehow need to specify whether to get a cleanup lock or a normal
exclusive lock. Maybe add yet another mode, RBM_ZERO_WITH_CLEANUP_LOCK.
Could also rename RBM_ZERO to e.g. RBM_ZERO_AND_LOCK, to make any code
that's not updated to break even more obviously, at compile-time.
- Heikki
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Heikki Linnakangas <hlinnakangas@vmware.com> writes:
On 11/12/2014 05:20 PM, Tom Lane wrote:
On reconsideration I think the "RBM_ZERO returns page already locked"
alternative may be the less ugly. That has the advantage that any code
that doesn't get updated will fail clearly and reliably.
Yeah, I'm leaning to that approach as well. It's made more ugly by the
fact that you sometimes need a cleanup lock on the buffer, so the caller
will somehow need to specify whether to get a cleanup lock or a normal
exclusive lock. Maybe add yet another mode, RBM_ZERO_WITH_CLEANUP_LOCK.
Could also rename RBM_ZERO to e.g. RBM_ZERO_AND_LOCK, to make any code
that's not updated to break even more obviously, at compile-time.
Yeah, I was considering suggesting changing the name of the mode too.
+1 for solving the lock-type problem with 2 modes.
(You could also consider leaving RBM_ZERO in place with its current
semantics, but I think what you've shown here is that there is no
safe way to use it, so probably that's not what we should do.)
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 11/12/14, 9:47 AM, Tom Lane wrote:
Heikki Linnakangas <hlinnakangas@vmware.com> writes:
On 11/12/2014 05:20 PM, Tom Lane wrote:
On reconsideration I think the "RBM_ZERO returns page already locked"
alternative may be the less ugly. That has the advantage that any code
that doesn't get updated will fail clearly and reliably.Yeah, I'm leaning to that approach as well. It's made more ugly by the
fact that you sometimes need a cleanup lock on the buffer, so the caller
will somehow need to specify whether to get a cleanup lock or a normal
exclusive lock. Maybe add yet another mode, RBM_ZERO_WITH_CLEANUP_LOCK.
Could also rename RBM_ZERO to e.g. RBM_ZERO_AND_LOCK, to make any code
that's not updated to break even more obviously, at compile-time.Yeah, I was considering suggesting changing the name of the mode too.
+1 for solving the lock-type problem with 2 modes.(You could also consider leaving RBM_ZERO in place with its current
semantics, but I think what you've shown here is that there is no
safe way to use it, so probably that's not what we should do.)
If we're tweaking modes, can we avoid zeroing the buffer if we're about to dump a full page image into it? Presumably that means we'd have to keep the page locked until the image is written...
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 11/13/2014 01:06 AM, Jim Nasby wrote:
On 11/12/14, 9:47 AM, Tom Lane wrote:
Heikki Linnakangas <hlinnakangas@vmware.com> writes:
On 11/12/2014 05:20 PM, Tom Lane wrote:
On reconsideration I think the "RBM_ZERO returns page already locked"
alternative may be the less ugly. That has the advantage that any code
that doesn't get updated will fail clearly and reliably.Yeah, I'm leaning to that approach as well. It's made more ugly by the
fact that you sometimes need a cleanup lock on the buffer, so the caller
will somehow need to specify whether to get a cleanup lock or a normal
exclusive lock. Maybe add yet another mode, RBM_ZERO_WITH_CLEANUP_LOCK.
Could also rename RBM_ZERO to e.g. RBM_ZERO_AND_LOCK, to make any code
that's not updated to break even more obviously, at compile-time.Yeah, I was considering suggesting changing the name of the mode too.
+1 for solving the lock-type problem with 2 modes.(You could also consider leaving RBM_ZERO in place with its current
semantics, but I think what you've shown here is that there is no
safe way to use it, so probably that's not what we should do.)
Committed that way. I left a placeholder for RBM_ZERO in back-branches,
so 3rd party extensions using it continue to work - to the extent that
they did before - without recompiling. In theory, it's possible that
someone is using RBM_ZERO while holding an AccessExclusiveLock on the
relation, or there's something else that ensures that no-one else is
looking at the relation. I doubt there actually are any extensions out
there using it, but might as well.
If we're tweaking modes, can we avoid zeroing the buffer if we're about to dump a full page image into it? Presumably that means we'd have to keep the page locked until the image is written...
Hmm. I suppose we could, as long as the caller knows that is has to
re-initialize the page. That would probably be ok for all current
callers. I didn't go ahead with that, however; the zeroing isn't that
expensive and we would need to rename the mode if it didn't zero the
page anymore, which would cause some code churn.
- Heikki
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers