fsync-pgdata-on-recovery tries to write to more files than previously
Hi,
the new fsync-pgdata-on-recovery code tries to open all files using
O_RDWR. At least on 9.1, this can make recovery fail:
* launch postgres, hit ^\ (or otherwise shut down uncleanly)
* touch foo; chmod 444 foo
* launch postgres
LOG: database system was interrupted; last known up at 2015-05-23 19:18:36 CEST
FATAL: could not open file "/home/cbe/9.1/foo": Permission denied
LOG: startup process (PID 27305) exited with exit code 1
LOG: aborting startup due to startup process failure
The code on 9.4 looks similar to me, but I couldn't trigger the
problem there.
I think this is a real-world problem:
1) In older releases, the SSL certs were read from the data directory,
and at least the default Debian installation creates symlinks from
PGDATA/server.* to /etc/ssl/ where PostgreSQL can't write
2) It's probably a pretty common scenario that the root user will edit
postgresql.conf, and make backups or create other random files there
that are not writable. Even a non-writable postgresql.conf itself or
recovery.conf was not a problem previously.
To me, this is a serious regression because it prevents automatic
startup of a server that would otherwise just run.
Christoph
--
cb@df7cb.de | http://www.df7cb.de/
Re: To PostgreSQL Hackers 2015-05-23 <20150523172627.GA24277@msg.df7cb.de>
Hi,
the new fsync-pgdata-on-recovery code tries to open all files using
O_RDWR. At least on 9.1, this can make recovery fail:* launch postgres, hit ^\ (or otherwise shut down uncleanly)
* touch foo; chmod 444 foo
* launch postgresLOG: database system was interrupted; last known up at 2015-05-23 19:18:36 CEST
FATAL: could not open file "/home/cbe/9.1/foo": Permission denied
LOG: startup process (PID 27305) exited with exit code 1
LOG: aborting startup due to startup process failureThe code on 9.4 looks similar to me, but I couldn't trigger the
problem there.
Correction: 9.4 is equally broken. (I was still running 9.4.1 when I
tried first.)
I think this is a real-world problem:
1) In older releases, the SSL certs were read from the data directory,
and at least the default Debian installation creates symlinks from
PGDATA/server.* to /etc/ssl/ where PostgreSQL can't write2) It's probably a pretty common scenario that the root user will edit
postgresql.conf, and make backups or create other random files there
that are not writable. Even a non-writable postgresql.conf itself or
recovery.conf was not a problem previously.
3) The .postgresql.conf.swp files created by (root's) vim are 0600.
To me, this is a serious regression because it prevents automatic
startup of a server that would otherwise just run.
Christoph
--
cb@df7cb.de | http://www.df7cb.de/
Christoph Berg <myon@debian.org> writes:
the new fsync-pgdata-on-recovery code tries to open all files using
O_RDWR. At least on 9.1, this can make recovery fail:
Hm. I wonder whether it would be all right to just skip files for which
we get EPERM on open(). The argument being that if we can't write to the
file, we should not be held responsible for fsync'ing it either. But
I'm not sure whether EPERM would be the only relevant errno, or whether
there are cases where this would mask real problems.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Re: Tom Lane 2015-05-23 <2284.1432413209@sss.pgh.pa.us>
Christoph Berg <myon@debian.org> writes:
the new fsync-pgdata-on-recovery code tries to open all files using
O_RDWR. At least on 9.1, this can make recovery fail:Hm. I wonder whether it would be all right to just skip files for which
we get EPERM on open(). The argument being that if we can't write to the
file, we should not be held responsible for fsync'ing it either. But
I'm not sure whether EPERM would be the only relevant errno, or whether
there are cases where this would mask real problems.
Maybe logging WARNINGs instead of FATAL would be enough of a fix?
Christoph
--
cb@df7cb.de | http://www.df7cb.de/
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2015-05-23 16:33:29 -0400, Tom Lane wrote:
Christoph Berg <myon@debian.org> writes:
the new fsync-pgdata-on-recovery code tries to open all files using
O_RDWR. At least on 9.1, this can make recovery fail:Hm. I wonder whether it would be all right to just skip files for which
we get EPERM on open(). The argument being that if we can't write to the
file, we should not be held responsible for fsync'ing it either. But
I'm not sure whether EPERM would be the only relevant errno, or whether
there are cases where this would mask real problems.
We could even try doing the a fsync with a readonly fd as a fallback,
but that's also pretty hacky.
How about, to avoid masking actual problems, we have a more
differentiated logic for the toplevel data directory? I think we could
just skip all non-directory files in there data_directory itself. None
of the files in the toplevel directory, with the exception of
postgresql.auto.conf, will ever get written to by PG itself. And if
there's readonly files somewhere in a subdirectory, I won't feel
particularly bad.
Greetings,
Andres Freund
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Re: Andres Freund 2015-05-24 <20150524005245.GD32396@alap3.anarazel.de>
How about, to avoid masking actual problems, we have a more
differentiated logic for the toplevel data directory? I think we could
just skip all non-directory files in there data_directory itself. None
of the files in the toplevel directory, with the exception of
postgresql.auto.conf, will ever get written to by PG itself. And if
there's readonly files somewhere in a subdirectory, I won't feel
particularly bad.
I like that idea.
Christoph
--
cb@df7cb.de | http://www.df7cb.de/
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Re: To Andres Freund 2015-05-24 <20150524075244.GB27048@msg.df7cb.de>
Re: Andres Freund 2015-05-24 <20150524005245.GD32396@alap3.anarazel.de>
How about, to avoid masking actual problems, we have a more
differentiated logic for the toplevel data directory? I think we could
just skip all non-directory files in there data_directory itself. None
of the files in the toplevel directory, with the exception of
postgresql.auto.conf, will ever get written to by PG itself. And if
there's readonly files somewhere in a subdirectory, I won't feel
particularly bad.
pg_log/ is also admin domain. What about only recursing into
well-known directories + postgresql.auto.conf?
(I've also been wondering if pg_basebackup shouldn't skip pg_log, but
that's a different topic...)
Christoph
--
cb@df7cb.de | http://www.df7cb.de/
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Christoph Berg <myon@debian.org> writes:
Re: To Andres Freund 2015-05-24 <20150524075244.GB27048@msg.df7cb.de>
Re: Andres Freund 2015-05-24 <20150524005245.GD32396@alap3.anarazel.de>
How about, to avoid masking actual problems, we have a more
differentiated logic for the toplevel data directory?
pg_log/ is also admin domain. What about only recursing into
well-known directories + postgresql.auto.conf?
The idea that this code would know exactly what's what under $PGDATA
scares me. I can positively guarantee that it would diverge from reality
over time, and nobody would notice until it ate their data, failed to
start, or otherwise behaved undesirably.
pg_log/ is a perfect example, because that is not a hard-wired directory
name; somebody could point the syslogger at a different place very easily.
Wiring in special behavior for that name is just wrong.
I would *much* rather have a uniform rule for how to treat each file
the scan comes across. It might take some tweaking to get to one that
works well; but once we did, we could have some confidence that it
wouldn't break later.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On May 24, 2015 7:52:53 AM PDT, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Christoph Berg <myon@debian.org> writes:
Re: To Andres Freund 2015-05-24 <20150524075244.GB27048@msg.df7cb.de>
Re: Andres Freund 2015-05-24
<20150524005245.GD32396@alap3.anarazel.de>
How about, to avoid masking actual problems, we have a more
differentiated logic for the toplevel data directory?pg_log/ is also admin domain. What about only recursing into
well-known directories + postgresql.auto.conf?The idea that this code would know exactly what's what under $PGDATA
scares me. I can positively guarantee that it would diverge from
reality
over time, and nobody would notice until it ate their data, failed to
start, or otherwise behaved undesirably.pg_log/ is a perfect example, because that is not a hard-wired
directory
name; somebody could point the syslogger at a different place very
easily.
Wiring in special behavior for that name is just wrong.I would *much* rather have a uniform rule for how to treat each file
the scan comes across. It might take some tweaking to get to one that
works well; but once we did, we could have some confidence that it
wouldn't break later.
If we'd merge it with initdb's list I think I'd not be that bad. I'm thinking of some header declaring it, roughly like the rmgr list.
Andres
---
Please excuse brevity and formatting - I am writing this on my mobile phone.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Andres Freund <andres@anarazel.de> writes:
On May 24, 2015 7:52:53 AM PDT, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Christoph Berg <myon@debian.org> writes:
pg_log/ is also admin domain. What about only recursing into
well-known directories + postgresql.auto.conf?
The idea that this code would know exactly what's what under $PGDATA
scares me. I can positively guarantee that it would diverge from
reality over time, and nobody would notice until it ate their data,
failed to start, or otherwise behaved undesirably.pg_log/ is a perfect example, because that is not a hard-wired
directory name; somebody could point the syslogger at a different place
very easily. Wiring in special behavior for that name is just wrong.I would *much* rather have a uniform rule for how to treat each file
the scan comes across. It might take some tweaking to get to one that
works well; but once we did, we could have some confidence that it
wouldn't break later.
If we'd merge it with initdb's list I think I'd not be that bad. I'm thinking of some header declaring it, roughly like the rmgr list.
pg_log/ is a counterexample to that idea too; initdb doesn't know about it
(and shouldn't).
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2015-05-25 13:38:01 -0400, Tom Lane wrote:
Andres Freund <andres@anarazel.de> writes:
On May 24, 2015 7:52:53 AM PDT, Tom Lane <tgl@sss.pgh.pa.us> wrote:
If we'd merge it with initdb's list I think I'd not be that bad. I'm thinking of some header declaring it, roughly like the rmgr list.pg_log/ is a counterexample to that idea too; initdb doesn't know about it
(and shouldn't).
The idea would be to *only* directories that initdb knows about. Since
that's where the valuables are. So I don't see how pg_log would be a
counterexample.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
* Andres Freund (andres@anarazel.de) wrote:
On 2015-05-25 13:38:01 -0400, Tom Lane wrote:
Andres Freund <andres@anarazel.de> writes:
On May 24, 2015 7:52:53 AM PDT, Tom Lane <tgl@sss.pgh.pa.us> wrote:
If we'd merge it with initdb's list I think I'd not be that bad. I'm thinking of some header declaring it, roughly like the rmgr list.pg_log/ is a counterexample to that idea too; initdb doesn't know about it
(and shouldn't).The idea would be to *only* directories that initdb knows about. Since
that's where the valuables are. So I don't see how pg_log would be a
counterexample.
Indeed, that wouldn't be included in the list of things to fsync and it
isn't listed in initdb, so that works.
I've not followed this thread all that closely, but I do tend to agree
with the idea of "only try to mess with files that are *clearly* ours to
mess with."
Thanks!
Stephen
Stephen Frost <sfrost@snowman.net> writes:
I've not followed this thread all that closely, but I do tend to agree
with the idea of "only try to mess with files that are *clearly* ours to
mess with."
Well, that opens us to errors of omission, ie failing to fsync things we
should have. Maybe that's an okay risk, but personally I'd judge that
"fsync everything and ignore (some?) errors" is probably a more robust
approach over time.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2015-05-25 14:02:28 -0400, Tom Lane wrote:
Stephen Frost <sfrost@snowman.net> writes:
I've not followed this thread all that closely, but I do tend to agree
with the idea of "only try to mess with files that are *clearly* ours to
mess with."Well, that opens us to errors of omission, ie failing to fsync things we
should have.
Is that really that likely? I mean we don't normally add data to the top
level directory itself, and subdirectories hopefully won't be added
except via initdb?
Maybe that's an okay risk, but personally I'd judge that
"fsync everything and ignore (some?) errors" is probably a more robust
approach over time.
The over-the-top approach would be to combine the two. Error out in
directories that are in the initdb list, and ignore permission errors
otherwise...
Additionally we could attempt to fsync with a readonly fd before trying
the read-write fd...
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
* Andres Freund (andres@anarazel.de) wrote:
On 2015-05-25 14:02:28 -0400, Tom Lane wrote:
Stephen Frost <sfrost@snowman.net> writes:
I've not followed this thread all that closely, but I do tend to agree
with the idea of "only try to mess with files that are *clearly* ours to
mess with."Well, that opens us to errors of omission, ie failing to fsync things we
should have.Is that really that likely? I mean we don't normally add data to the top
level directory itself, and subdirectories hopefully won't be added
except via initdb?
That feels like a pretty low risk, to me at least. Certainly better
than having a failure, like what's going on now.
Maybe that's an okay risk, but personally I'd judge that
"fsync everything and ignore (some?) errors" is probably a more robust
approach over time.The over-the-top approach would be to combine the two. Error out in
directories that are in the initdb list, and ignore permission errors
otherwise...
That seems overly complicated, for my 2c at least. I don't particularly
like trying to mess with files that might be rightfully considered "not
ours" either. This all makes me really wonder about
postgresql.auto.conf too.. Clearly, on the one hand, we consider that
"our" file, and so we should error out if we don't own it, but on the
other hand, I've specifically recommended making that file owned by
root to some folks, to avoid DBAs playing with the startup-time
settings..
Additionally we could attempt to fsync with a readonly fd before trying
the read-write fd...
Not really sure I see that as helping.
Thanks!
Stephen
On 2015-05-25 14:14:10 -0400, Stephen Frost wrote:
That seems overly complicated, for my 2c at least. I don't particularly
like trying to mess with files that might be rightfully considered "not
ours" either.
I'd not consider an fsync to be "messing" with files, especially if
they're in PGDATA.
Additionally we could attempt to fsync with a readonly fd before trying
the read-write fd...Not really sure I see that as helping.
On most OSs, except windows and some obscure unixes, a readonly fd is
allowed to fsync a file.
Greetings,
Andres Freund
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Andres Freund <andres@anarazel.de> writes:
On 2015-05-25 14:14:10 -0400, Stephen Frost wrote:
Additionally we could attempt to fsync with a readonly fd before trying
the read-write fd...
Not really sure I see that as helping.
On most OSs, except windows and some obscure unixes, a readonly fd is
allowed to fsync a file.
Perhaps, but if we didn't have permission to write the file, it's hard to
argue that it's our responsibility to fsync it. So this seems like it's
adding complexity without really adding any safety.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
* Andres Freund (andres@anarazel.de) wrote:
On 2015-05-25 14:14:10 -0400, Stephen Frost wrote:
That seems overly complicated, for my 2c at least. I don't particularly
like trying to mess with files that might be rightfully considered "not
ours" either.I'd not consider an fsync to be "messing" with files, especially if
they're in PGDATA.
I'm not entirely sure I agree.
Additionally we could attempt to fsync with a readonly fd before trying
the read-write fd...Not really sure I see that as helping.
On most OSs, except windows and some obscure unixes, a readonly fd is
allowed to fsync a file.
I wouldn't have thought otherwise, given that you were suggesting it,
but there's no guarantee we're going to be allowed to read it either, or
even access the directory the symlink points to, etc..
Thanks,
Stephen
Tom Lane wrote:
Stephen Frost <sfrost@snowman.net> writes:
I've not followed this thread all that closely, but I do tend to agree
with the idea of "only try to mess with files that are *clearly* ours to
mess with."Well, that opens us to errors of omission, ie failing to fsync things we
should have. Maybe that's an okay risk, but personally I'd judge that
"fsync everything and ignore (some?) errors" is probably a more robust
approach over time.
How is it possible to make errors of omission? The list of directories
in initdb is the complete set of directories that are created for a
newly-initdb'd database, no? Surely there can't be a database that
contains vital directories that are not created there? See "subdirs"
static in initdb.c.
--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
Tom Lane wrote:
Well, that opens us to errors of omission, ie failing to fsync things we
should have. Maybe that's an okay risk, but personally I'd judge that
"fsync everything and ignore (some?) errors" is probably a more robust
approach over time.
How is it possible to make errors of omission? The list of directories
in initdb is the complete set of directories that are created for a
newly-initdb'd database, no? Surely there can't be a database that
contains vital directories that are not created there? See "subdirs"
static in initdb.c.
Easy: all you need is to suppose that some of the plain files at top level
of $PGDATA ought to be fsync'd. (I'm fairly sure for example that we took
steps awhile back to force postmaster.pid to be fsync'd.) If there is a
distinction between the fsync requirements of top-level files and
everything else, it is completely accidental and not to be relied on.
And from the other direction, where exactly is it written that
distros/users will only create problematic files at the top level of
$PGDATA? I'd have zero confidence in such an assertion applied to
tablespace directories, for sure.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers