RelationCreateStorage can orphan files

Started by Robert Haasover 15 years ago7 messageshackers
Jump to latest
#1Robert Haas
robertmhaas@gmail.com

I notice that RelationCreateStorage() creates the main fork on disk
before writing (let alone flushing) WAL. So if PG gets killed at that
point, we end up with an orphaned file on disk. I think that we could
even extend the relation a few times before WAL gets written, so I
don't even think it's necessarily a zero-size file. We could perhaps
avoid this by writing and flushing a WAL record that includes the
creating XID before touching the disk; when we replay the record, we
create the file but then delete it if the XID fails to commit before
recovery ends. But I guess maybe our feeling is that it's just not
worth taking a performance hit for this?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#1)
Re: RelationCreateStorage can orphan files

Robert Haas <robertmhaas@gmail.com> writes:

I notice that RelationCreateStorage() creates the main fork on disk
before writing (let alone flushing) WAL. So if PG gets killed at that
point, we end up with an orphaned file on disk. I think that we could
even extend the relation a few times before WAL gets written, so I
don't even think it's necessarily a zero-size file. We could perhaps
avoid this by writing and flushing a WAL record that includes the
creating XID before touching the disk; when we replay the record, we
create the file but then delete it if the XID fails to commit before
recovery ends. But I guess maybe our feeling is that it's just not
worth taking a performance hit for this?

That design is intentional. If the file create fails, and you've
already written a WAL record that says you created it, you are flat
out screwed. You can't even PANIC --- if you do, then the replay of
the WAL record will likely fail and PANIC again, leaving the database
dead in the water.

Orphaned files, in contrast, are completely non-dangerous --- the worst
they can do is waste a little bit of disk space. That's a cheap price
to pay for not having an unrecoverable database after a create failure.

This is essentially the same reason why CREATE DATABASE and related
commands xlog directory copy operations only after completing them.
That potentially wastes much more than a few blocks; but it's still
non-dangerous, and far safer than the alternative.

regards, tom lane

#3Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#2)
Re: RelationCreateStorage can orphan files

On Wed, Sep 15, 2010 at 9:16 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

I notice that RelationCreateStorage() creates the main fork on disk
before writing (let alone flushing) WAL.  So if PG gets killed at that
point, we end up with an orphaned file on disk.  I think that we could
even extend the relation a few times before WAL gets written, so I
don't even think it's necessarily a zero-size file.  We could perhaps
avoid this by writing and flushing a WAL record that includes the
creating XID before touching the disk; when we replay the record, we
create the file but then delete it if the XID fails to commit before
recovery ends.  But I guess maybe our feeling is that it's just not
worth taking a performance hit for this?

That design is intentional.  If the file create fails, and you've
already written a WAL record that says you created it, you are flat
out screwed.  You can't even PANIC --- if you do, then the replay of
the WAL record will likely fail and PANIC again, leaving the database
dead in the water.

Not that this is perhaps more than of academic interest, but could you
get around this problem by making the replay of the XLOG record defer
the creation of the file until such time as it's actually written to
or the creating XID commits? And also, if the XID does not commit,
going back and trying to remove the file (on a best effort basis)?

Orphaned files, in contrast, are completely non-dangerous --- the worst
they can do is waste a little bit of disk space.  That's a cheap price
to pay for not having an unrecoverable database after a create failure.

This is essentially the same reason why CREATE DATABASE and related
commands xlog directory copy operations only after completing them.
That potentially wastes much more than a few blocks; but it's still
non-dangerous, and far safer than the alternative.

Thanks for the explanation.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#3)
Re: RelationCreateStorage can orphan files

Robert Haas <robertmhaas@gmail.com> writes:

On Wed, Sep 15, 2010 at 9:16 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

That design is intentional. �If the file create fails, and you've
already written a WAL record that says you created it, you are flat
out screwed. �You can't even PANIC --- if you do, then the replay of
the WAL record will likely fail and PANIC again, leaving the database
dead in the water.

Not that this is perhaps more than of academic interest, but could you
get around this problem by making the replay of the XLOG record defer
the creation of the file until such time as it's actually written to
or the creating XID commits? And also, if the XID does not commit,
going back and trying to remove the file (on a best effort basis)?

Perhaps, but it seems like a lot more complexity than is justified
by the problem.

regards, tom lane

#5Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#4)
Re: RelationCreateStorage can orphan files

On Wed, Sep 15, 2010 at 10:32 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

On Wed, Sep 15, 2010 at 9:16 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

That design is intentional.  If the file create fails, and you've
already written a WAL record that says you created it, you are flat
out screwed.  You can't even PANIC --- if you do, then the replay of
the WAL record will likely fail and PANIC again, leaving the database
dead in the water.

Not that this is perhaps more than of academic interest, but could you
get around this problem by making the replay of the XLOG record defer
the creation of the file until such time as it's actually written to
or the creating XID commits?  And also, if the XID does not commit,
going back and trying to remove the file (on a best effort basis)?

Perhaps, but it seems like a lot more complexity than is justified
by the problem.

That's sort of what I figured.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

#6Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#2)
Re: RelationCreateStorage can orphan files

Tom Lane wrote:

Robert Haas <robertmhaas@gmail.com> writes:

I notice that RelationCreateStorage() creates the main fork on disk
before writing (let alone flushing) WAL. So if PG gets killed at that
point, we end up with an orphaned file on disk. I think that we could
even extend the relation a few times before WAL gets written, so I
don't even think it's necessarily a zero-size file. We could perhaps
avoid this by writing and flushing a WAL record that includes the
creating XID before touching the disk; when we replay the record, we
create the file but then delete it if the XID fails to commit before
recovery ends. But I guess maybe our feeling is that it's just not
worth taking a performance hit for this?

That design is intentional. If the file create fails, and you've
already written a WAL record that says you created it, you are flat
out screwed. You can't even PANIC --- if you do, then the replay of
the WAL record will likely fail and PANIC again, leaving the database
dead in the water.

Orphaned files, in contrast, are completely non-dangerous --- the worst
they can do is waste a little bit of disk space. That's a cheap price
to pay for not having an unrecoverable database after a create failure.

This is essentially the same reason why CREATE DATABASE and related
commands xlog directory copy operations only after completing them.
That potentially wastes much more than a few blocks; but it's still
non-dangerous, and far safer than the alternative.

Is this documented in a C comment somewhere? Obviously not in a place
Robert found.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#6)
Re: RelationCreateStorage can orphan files

Bruce Momjian <bruce@momjian.us> writes:

Tom Lane wrote:

This is essentially the same reason why CREATE DATABASE and related
commands xlog directory copy operations only after completing them.
That potentially wastes much more than a few blocks; but it's still
non-dangerous, and far safer than the alternative.

Is this documented in a C comment somewhere? Obviously not in a place
Robert found.

I had thought it was documented in the discussion of WAL logging rules
in access/transam/README, but it isn't. I'll see about adding
something.

regards, tom lane