BUG #4879: bgwriter fails to fsync the file in recovery mode
The following bug has been logged online:
Bug reference: 4879
Logged by: Fujii Masao
Email address: masao.fujii@gmail.com
PostgreSQL version: 8.4dev
Operating system: RHEL5.1 x86_64
Description: bgwriter fails to fsync the file in recovery mode
Details:
The restartpoint by bgwriter in recovery mode caused the following error.
ERROR: could not fsync segment 0 of relation base/11564/16422_fsm: No
such file or directory
The following procedure can reproduce this error.
(1) create warm-standby environment
(2) execute "pgbench -i -s10"
(3) execute the following SQLs
TRUNCATE pgbench_accounts ;
TRUNCATE pgbench_branches ;
TRUNCATE pgbench_history ;
TRUNCATE pgbench_tellers ;
CHECKPOINT ;
SELECT pg_switch_xlog();
(4) wait a minute, then the upcoming restartpoint would cause the error
in the standby server.
Whether this error happens or not depends on the timing of operations.
So, you might need to repeat the procedure (2) and (3) in order to
reproduce the error.
I suspect that the cause of this error is the race condition between
file deletion by startup process and fsync by bgwriter: TRUNCATE xlog
record immediately deletes the corresponding file, while it might be
scheduled to be fsynced by bgwriter. We should leave the actual file
deletion to bgwriter instead of startup process, like normal mode?
On Thu, 2009-06-25 at 12:55 +0000, Fujii Masao wrote:
The following bug has been logged online:
Bug reference: 4879
Logged by: Fujii Masao
Email address: masao.fujii@gmail.com
PostgreSQL version: 8.4dev
Operating system: RHEL5.1 x86_64
Description: bgwriter fails to fsync the file in recovery mode
Details:
Looking at it now.
--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support
Simon Riggs wrote:
On Thu, 2009-06-25 at 12:55 +0000, Fujii Masao wrote:
The following bug has been logged online:
Bug reference: 4879
Logged by: Fujii Masao
Email address: masao.fujii@gmail.com
PostgreSQL version: 8.4dev
Operating system: RHEL5.1 x86_64
Description: bgwriter fails to fsync the file in recovery mode
Details:Looking at it now.
Thanks.
I suspect that the cause of this error is the race condition between
file deletion by startup process and fsync by bgwriter: TRUNCATE xlog
record immediately deletes the corresponding file, while it might be
scheduled to be fsynced by bgwriter. We should leave the actual file
deletion to bgwriter instead of startup process, like normal mode?
I think the real problem is this in mdunlink():
/* Register request to unlink first segment later */
if (!isRedo && forkNum == MAIN_FORKNUM)
register_unlink(rnode);
When we replay the unlink of the relation, we don't te bgwriter about
it. Normally we do, so bgwriter knows that if the fsync() fails with
ENOENT, it's ok since the file was deleted.
It's tempting to just remove the "!isRedo" condition, but then we have
another problem: if bgwriter hasn't been started yet, and the shmem
queue is full, we get stuck in register_unlink() trying to send the
message and failing.
In archive recovery, we always start bgwriter at the beginning of WAL
replay. In crash recovery, we don't start bgwriter until the end of wAL
replay. So we could change the "!isRedo" condition to
"!InArchiveRecovery". It's not a very clean solution, but it's simple.
Hmm, what happens when the startup process performs a write, and
bgwriter is not running? Do the fsync requests queue up in the shmem
queue until the end of recovery when bgwriter is launched? I guess I'll
have to try it out...
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas wrote:
Hmm, what happens when the startup process performs a write, and
bgwriter is not running? Do the fsync requests queue up in the shmem
queue until the end of recovery when bgwriter is launched? I guess I'll
have to try it out...
Oh dear, doesn't look good. The startup process has a pendingOpsTable of
its own. bgwriter won't fsync() files that the startup process has
written itself. That needs to be fixed, or you can lose data when an
archive recovery crashes after a restartpoint.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
In archive recovery, we always start bgwriter at the beginning of WAL
replay. In crash recovery, we don't start bgwriter until the end of wAL
replay. So we could change the "!isRedo" condition to
"!InArchiveRecovery". It's not a very clean solution, but it's simple.
This is probably what is needed. We need to look around for other tests
of "in redo" that have been obsoleted by the change in bgwriter
behavior.
regards, tom lane
On Thu, 2009-06-25 at 12:55 +0000, Fujii Masao wrote:
The restartpoint by bgwriter in recovery mode caused the following error.
ERROR: could not fsync segment 0 of relation base/11564/16422_fsm: No
such file or directory
I think I see a related bug also.
register_dirty_segment() run by Startup process doesn't differentiate
correctly whether the bgwriter is active. md.c line 1194. So it will
continue to register fsync requests into its own private pendingOpsTable
when it should be forwarding them to bgwriter. When bgwriter runs
mdsync() the contents of startup's pendingOpsTable will be ignored.
That looks to me to be a serious bug.
--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support
On Thu, 2009-06-25 at 17:35 +0300, Heikki Linnakangas wrote:
Heikki Linnakangas wrote:
Hmm, what happens when the startup process performs a write, and
bgwriter is not running? Do the fsync requests queue up in the shmem
queue until the end of recovery when bgwriter is launched? I guess I'll
have to try it out...Oh dear, doesn't look good. The startup process has a pendingOpsTable of
its own. bgwriter won't fsync() files that the startup process has
written itself. That needs to be fixed, or you can lose data when an
archive recovery crashes after a restartpoint.
Yes, that's what I see also. Patch attached.
--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support
Attachments:
forwardsfsyncrequests.v1.patchtext/x-patch; charset=utf-8; name=forwardsfsyncrequests.v1.patchDownload+31-12
Tom Lane wrote:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
In archive recovery, we always start bgwriter at the beginning of WAL
replay. In crash recovery, we don't start bgwriter until the end of wAL
replay. So we could change the "!isRedo" condition to
"!InArchiveRecovery". It's not a very clean solution, but it's simple.This is probably what is needed. We need to look around for other tests
of "in redo" that have been obsoleted by the change in bgwriter
behavior.
We have another problem with the end-of-recovery checkpoint. When the
startup process does the checkpoint, it won't know to perform the
pending fsyncs() the bgwriter has absorbed.
A short fix would be to have bgwriter do the shutdown checkpoint instead
in archive recovery. I don't recall if there was a reason it wasn't
coded like that to begin with, though.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
Heikki Linnakangas wrote:
Hmm, what happens when the startup process performs a write, and
bgwriter is not running? Do the fsync requests queue up in the shmem
queue until the end of recovery when bgwriter is launched? I guess I'll
have to try it out...
Oh dear, doesn't look good. The startup process has a pendingOpsTable of
its own. bgwriter won't fsync() files that the startup process has
written itself. That needs to be fixed, or you can lose data when an
archive recovery crashes after a restartpoint.
Ouch. I'm beginning to think that the best thing is to temporarily
revert the change that made bgwriter active during recovery. It's
obviously not been adequately thought through or tested.
regards, tom lane
On Thu, 2009-06-25 at 17:02 +0300, Heikki Linnakangas wrote:
I think the real problem is this in mdunlink():
/* Register request to unlink first segment later */
if (!isRedo && forkNum == MAIN_FORKNUM)
register_unlink(rnode);When we replay the unlink of the relation, we don't te bgwriter about
it. Normally we do, so bgwriter knows that if the fsync() fails with
ENOENT, it's ok since the file was deleted.It's tempting to just remove the "!isRedo" condition, but then we have
another problem: if bgwriter hasn't been started yet, and the shmem
queue is full, we get stuck in register_unlink() trying to send the
message and failing.In archive recovery, we always start bgwriter at the beginning of WAL
replay. In crash recovery, we don't start bgwriter until the end of wAL
replay. So we could change the "!isRedo" condition to
"!InArchiveRecovery". It's not a very clean solution, but it's simple.
That seems to work for me, though I have some doubts as to the way two
phase commit is coded. 2PC seems to assume that if a file still exists
we must be in recovery and its OK to ignore.
Clean? We've changed the conditions under which the unlink needs to be
registered and !InArchiveRecovery defines the changed conditions
precisely.
--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support
Tom Lane wrote:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
Heikki Linnakangas wrote:
Hmm, what happens when the startup process performs a write, and
bgwriter is not running? Do the fsync requests queue up in the shmem
queue until the end of recovery when bgwriter is launched? I guess I'll
have to try it out...Oh dear, doesn't look good. The startup process has a pendingOpsTable of
its own. bgwriter won't fsync() files that the startup process has
written itself. That needs to be fixed, or you can lose data when an
archive recovery crashes after a restartpoint.Ouch. I'm beginning to think that the best thing is to temporarily
revert the change that made bgwriter active during recovery. It's
obviously not been adequately thought through or tested.
That was my first thought too, but unfortunately we now rely on bgwriter
to perform restartpoints :-(.
I came up with the attached patch, which includes Simon's patch to have
all fsync requests forwarded to bgwriter during archive recovery. To fix
the startup checkpoint issue, startup process requests a forced
restartpoint, which will flush any fsync requests bgwriter has
accumulated, before doing the actual checkpoint in the startup process.
This is completely untested still, but does anyone immediately see any
more problems?
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
Attachments:
fix-archive-recovery-pendingOpsTable-confusion-1.patchtext/x-diff; name=fix-archive-recovery-pendingOpsTable-confusion-1.patchDownload+37-22
On Thu, 2009-06-25 at 11:31 -0400, Tom Lane wrote:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
Heikki Linnakangas wrote:
Hmm, what happens when the startup process performs a write, and
bgwriter is not running? Do the fsync requests queue up in the shmem
queue until the end of recovery when bgwriter is launched? I guess I'll
have to try it out...Oh dear, doesn't look good. The startup process has a pendingOpsTable of
its own. bgwriter won't fsync() files that the startup process has
written itself. That needs to be fixed, or you can lose data when an
archive recovery crashes after a restartpoint.Ouch. I'm beginning to think that the best thing is to temporarily
revert the change that made bgwriter active during recovery.
That seems the safest course, to avoid derailing the schedule.
Please define "temporarily". Will it be re-enabled in 8.4.1, assuming we
find an acceptable fix?
It's
obviously not been adequately thought through or tested.
Hmmm...
--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support
On Thu, 2009-06-25 at 18:12 +0300, Heikki Linnakangas wrote:
A short fix would be to have bgwriter do the shutdown checkpoint instead
in archive recovery. I don't recall if there was a reason it wasn't
coded like that to begin with, though.
I think the problem was that it was coded both ways at different stages
of patch evolution. The decision to retain the shutdown checkpoint by
the startup process was taken in January, IIRC.
Having startup process issue this
if (InArchiveRecovery)
RequestCheckpoint(CHECKPOINT_IS_SHUTDOWN |
CHECKPOINT_FORCE |
CHECKPOINT_WAIT)
else
should make the startup process wait for bgwriter to complete the
checkpoint. But we need to set LocalRecoveryInProgress appropriately
also.
--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support
Simon Riggs wrote:
On Thu, 2009-06-25 at 18:12 +0300, Heikki Linnakangas wrote:
A short fix would be to have bgwriter do the shutdown checkpoint instead
in archive recovery. I don't recall if there was a reason it wasn't
coded like that to begin with, though.I think the problem was that it was coded both ways at different stages
of patch evolution. The decision to retain the shutdown checkpoint by
the startup process was taken in January, IIRC.Having startup process issue this
if (InArchiveRecovery)
RequestCheckpoint(CHECKPOINT_IS_SHUTDOWN |
CHECKPOINT_FORCE |
CHECKPOINT_WAIT)
elseshould make the startup process wait for bgwriter to complete the
checkpoint. But we need to set LocalRecoveryInProgress appropriately
also.
Yeah, the trouble is to tell bgwriter that it's OK for it to create the
checkpoint, which includes writing a WAL record, while still keeping the
system "in-recovery" for all other purposes. While we could just relax
the checks, that seems like a very important safeguard.
(I posted in the other mail to do a restartpoint before the startup
process does the checkpoint)
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
On Thu, 2009-06-25 at 18:57 +0300, Heikki Linnakangas wrote:
I came up with the attached patch, which includes Simon's patch to
have all fsync requests forwarded to bgwriter during archive recovery.
To fix the startup checkpoint issue, startup process requests a forced
restartpoint, which will flush any fsync requests bgwriter has
accumulated, before doing the actual checkpoint in the startup
process.
That looks fine.
This is completely untested still, but does anyone immediately see any
more problems?
Nothing seen.
--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support
On Thu, 2009-06-25 at 19:20 +0300, Heikki Linnakangas wrote:
Simon Riggs wrote:
But we need to set LocalRecoveryInProgress appropriately
also.
Yeah, the trouble is to tell bgwriter that it's OK for it to create the
checkpoint, which includes writing a WAL record, while still keeping the
system "in-recovery" for all other purposes. While we could just relax
the checks, that seems like a very important safeguard.(I posted in the other mail to do a restartpoint before the startup
process does the checkpoint)
I think we're in strong agreement, just thinking and writing emails in
realtime gives a few race conditions... your patch covers everything I
mentioned above.
--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
This is completely untested still, but does anyone immediately see any
more problems?
Isn't SetForwardFsyncRequests going to cause the final mdsync to fail?
regards, tom lane
Tom Lane wrote:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
This is completely untested still, but does anyone immediately see any
more problems?Isn't SetForwardFsyncRequests going to cause the final mdsync to fail?
Yes, I just noticed that myself, testing it for the first time. That
check needs to be suppressed in the startup checkpoint.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
Tom Lane wrote:
Isn't SetForwardFsyncRequests going to cause the final mdsync to fail?
Yes, I just noticed that myself, testing it for the first time. That
check needs to be suppressed in the startup checkpoint.
Ugh. Better would be to move the responsibility for the final
checkpoint to the bgwriter.
However, this whole line of thought still seems to be leading towards
"fix the patch", and I don't have much confidence that we can have a
trustworthy fix today. What about "revert the patch"?
regards, tom lane
On Thu, 2009-06-25 at 12:33 -0400, Tom Lane wrote:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
This is completely untested still, but does anyone immediately see any
more problems?Isn't SetForwardFsyncRequests going to cause the final mdsync to fail?
Yes it will, as it is now.
I didn't read the parts of the patch that were described as being from
my patch. Heikki, please tell me if you change things... it means I have
to be both patch author and reviewer. Perhaps just separate patches, so
we can review them independently.
--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support