WAL replay of truncate fails if the table was dropped

Started by Heikki Linnakangasover 18 years ago6 messagesbugs
Jump to latest
#1Heikki Linnakangas
heikki.linnakangas@enterprisedb.com

mdtruncate throws an error if the relation file doesn't exist. However,
that's not an error condition if the relation was dropped later.
Non-existent file should be treated the same as an already truncated
file; we now end up with an unrecoverable database.

This bug seems to be present from 8.0 onwards.

Attached is a test case to reproduce it, along with a patch for CVS
HEAD, and an adapted version of the patch for 8.0-8.2.

Thanks to my colleague Dharmendra Goyal for finding this bug and
constructing an initial test case.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachments:

crash-2.sqltext/x-sql; name=crash-2.sqlDownload
truncate-replay-fix.patchtext/x-diff; name=truncate-replay-fix.patchDownload+12-2
truncate-replay-fix-80.patchtext/x-diff; name=truncate-replay-fix-80.patchDownload+12-2
#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Heikki Linnakangas (#1)
Re: WAL replay of truncate fails if the table was dropped

Heikki Linnakangas <heikki@enterprisedb.com> writes:

mdtruncate throws an error if the relation file doesn't exist.

Interesting corner case. The proposed fix seems not very consistent
with the way we handle comparable cases elsewhere, though. In general,
md.c will cut some slack when InRecovery if a relation is shorter than
expected, but not if it's not there at all. (This is, indeed, what
justifies mdtruncate's response to file-too-short...) We handle
dropped files during recovery by forced smgrcreate() in places like
XLogOpenRelation. I'm inclined to think smgr_redo should force
smgrcreate() before trying to truncate.

regards, tom lane

#3Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Tom Lane (#2)
Re: WAL replay of truncate fails if the table was dropped

Tom Lane wrote:

Heikki Linnakangas <heikki@enterprisedb.com> writes:

mdtruncate throws an error if the relation file doesn't exist.

Interesting corner case. The proposed fix seems not very consistent
with the way we handle comparable cases elsewhere, though. In general,
md.c will cut some slack when InRecovery if a relation is shorter than
expected, but not if it's not there at all. (This is, indeed, what
justifies mdtruncate's response to file-too-short...) We handle
dropped files during recovery by forced smgrcreate() in places like
XLogOpenRelation. I'm inclined to think smgr_redo should force
smgrcreate() before trying to truncate.

I followed the example of the file-too-short case. Yeah, calling
smgrcreate would work and I can see the justification for that as well.

Interestingly, this bug isn't triggered unless there's an already empty
or uninitialized page at the end of table. If vacuum removes the last
tuple from the page, that will be WAL-logged and replay of that calls
smgrcreate.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Heikki Linnakangas (#3)
Re: WAL replay of truncate fails if the table was dropped

Heikki Linnakangas <heikki@enterprisedb.com> writes:

Interestingly, this bug isn't triggered unless there's an already empty
or uninitialized page at the end of table. If vacuum removes the last
tuple from the page, that will be WAL-logged and replay of that calls
smgrcreate.

Yeah, I tried other ways to provoke the failure and came to the same
conclusion. The reproducer really is relying on the fact that vacuum's
PageInit of an uninitialized page doesn't get WAL-logged. Which is a
bit nervous-making. As far as I can think at the moment, it won't
provoke any problem because the first subsequent WAL-logged touch of
the page would be an INSERT with the INIT bit set; but it does mean
that a warm-standby slave would be out of sync with the master for an
indefinitely long period with respect to the on-disk contents of such a
page. Does that matter?

Note that we have to fix truncate replay anyway, since you could have
the same failure if a checkpoint happened just before an ordinary
vacuum's truncate. This PageInit behavior merely allows a simpler
reproducer script with no race condition involved.

regards, tom lane

#5Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#4)
Re: WAL replay of truncate fails if the table was dropped

On Fri, 2007-07-20 at 11:38 -0400, Tom Lane wrote:

Heikki Linnakangas <heikki@enterprisedb.com> writes:

Interestingly, this bug isn't triggered unless there's an already empty
or uninitialized page at the end of table. If vacuum removes the last
tuple from the page, that will be WAL-logged and replay of that calls
smgrcreate.

Yeah, I tried other ways to provoke the failure and came to the same
conclusion. The reproducer really is relying on the fact that vacuum's
PageInit of an uninitialized page doesn't get WAL-logged. Which is a
bit nervous-making. As far as I can think at the moment, it won't
provoke any problem because the first subsequent WAL-logged touch of
the page would be an INSERT with the INIT bit set; but it does mean
that a warm-standby slave would be out of sync with the master for an
indefinitely long period with respect to the on-disk contents of such a
page. Does that matter?

If I understand this: the primary would be initialised yet the standby
would remain uninitialised? I don't think that matters because the
actual the data contents are still zero.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#5)
Re: WAL replay of truncate fails if the table was dropped

"Simon Riggs" <simon@2ndquadrant.com> writes:

If I understand this: the primary would be initialised yet the standby
would remain uninitialised? I don't think that matters because the
actual the data contents are still zero.

From a logical perspective the page is "empty" either way. The only
behavioral difference I can think of is that the initialized page is a
candidate for insertion of new tuples, whereas on the slave it would not
be a candidate until after another VACUUM. So the histories would
diverge faster once the slave comes alive. As long as the slave is just
following WAL records and not making any decisions of its own, I can't
see a failure mode; but it looks like a potential weak spot for future
extensions (particularly, trying to allow slave servers to execute
queries).

regards, tom lane