BUG #7624: Misleading Log Message & Inconsistent Configuration Design
The following bug has been logged on the website:
Bug reference: 7624
Logged by: Tianyin Xu
Email address: tixu@cs.ucsd.edu
PostgreSQL version: 9.2.1
Operating system: Linux (actually doesn't matter)
Description:
Hi, Postgresql,
I'm here to report a very misleading log message and inconsistent
configuration design. It confuses users, unnecessarily complicating
configuration error diagnosis and causing frustration. As a computer system
grad student familiar with C code and GDB, it took me ~1.8 hours to figure
out the problem.
--------------------------
So, in the postgresql.conf, I had the following entries:
data_directory = 'local/pgsql/data/'
hba_file = 'local/pgsql/data/pg_hba.conf'
ident_file = 'local/pgsql/data/pg_ident.conf'
unix_socket_directory = 'local/pgsql/data/'
Without the “unix_socket_directory” setting, when I started the server,
everything was fine.
However, with the “unix_socket_directory” setting, the server refused to
start with the following message:
FATAL: could not create lock file "local/pgsql/data/.s.PGSQL.5432.lock": No
such file or directory
Since the sever definitely could find “data_directory”, “hba_file”, and
“ident_file”, the “local/pgsql/data” was definitely considered as
***valid*** path.
The weird thing to me is that typing “#touch
local/pgsql/data/.s.PGSQL.5432.lock” could generate the file, and even I
generated the file, the server still complained with the same log message.
Then, I used GDB to trace the error point and found the message was printed
because of the failure of “fopen”,
/* src/backend/utils/init/miscinit.c */
static void
CreateLockFile(const char *filename, bool amPostmaster,
bool isDDLock, const char *refName)
{
….
fd = open(filename, O_RDWR | O_CREAT | O_EXCL, 0600);
if ((errno != EEXIST && errno != EACCES) || ntries > 100)
ereport(FATAL,
(errcode_for_file_access(),
errmsg("could not create lock file
\"%s\": %m",
filename)));
….
}
I copied this piece of code out into a test.c file, compiled, and ran. It
did generate the “local/pgsql/data/.s.PGSQL.5432.lock”!!!, which means the
function “open” definitely works.
Then, I print the errno and found it's 2, i.e., ENOENT:
O_CREAT is not set and the named file does not exist; or O_CREAT is set and
either the path prefix does not exist or the path argument points to an
empty string.
But I thought (I was stupid) this's bullshit. Because path was not an empty
string, and it definitely existed! Otherwise how could “data_directory”,
“hba_file”, “ident_file” and the other similar file/directory paths work?
No idea...
I installed the newest postgres-9.2.1 again, but nothing changed. I
suspected it's because of some weird settings on my machine, so I installed
postgres again on a new server machine, but still nothing changed!
No idea...
Found a senior student and explained this problem. He tried the same thing
as I did and also had no idea.
What the hell...
Go back to the code, trace more carefully (patient, patient, patient), and
finally found the problem!!!
On the execution path, the server first called:
/*src/backend/postmaster/postmaster.c*/
757 ChangeToDataDir();
And then, called
/*src/backend/utils/init/miscinit.c*/
965 CreateLockFile(lockfile, amPostmaster, false, socketfile);
So, when it called open(1), the working directory is changed to the data
directory!
Actually the errno makes sense. I was stupid!!
-----------------------------
But how can users know this without reading the code... The only interface
exposed to users are the configuration file which is not consistent at all.
And the log message does not help but mislead users (the directory **does**
exist...).
I really suggest to make the configuration file consistent in this case. But
I understand it might not be easy. But at least I think we should print a
better log message which pinpoints to the absolute path like
FATAL: 58P01: could not create lock file
"/home/tianyin/postgresql-9.2.1/local/pgsql/data/local/pgsql/data/.s.PGSQL.5432.lock":
No such file or directory
In this case, the users can definitely figure out the problem by themselves.
Here is the patch:
diff --git a/miscinit_old.c b/miscinit.c
index fb376a0..8df83a8 100644
--- a/miscinit_old.c
+++ b/miscinit.c
@@ -736,11 +736,13 @@ CreateLockFile(const char *filename, bool
amPostmaster,
/*
* Couldn't create the pid file. Probably it already
exists.
*/
- if ((errno != EEXIST && errno != EACCES) || ntries > 100)
+ if ((errno != EEXIST && errno != EACCES) || ntries > 100) {
+ char* abs_filename = make_absolute_path(filename);
ereport(FATAL,
(errcode_for_file_access(),
errmsg("could not create lock file
\"%s\": %m",
- filename)));
+ abs_filename)));
+ }
/*
* Read the file to get the old owner's PID. Note race
condition
Thanks a lot!
Tianyin
On 10/28/2012 03:35 PM, tixu@cs.ucsd.edu wrote:
I really suggest to make the configuration file consistent in this case. But
I understand it might not be easy. But at least I think we should print a
better log message which pinpoints to the absolute path likeFATAL: 58P01: could not create lock file
"/home/tianyin/postgresql-9.2.1/local/pgsql/data/local/pgsql/data/.s.PGSQL.5432.lock":
No such file or directory
Essentially, you want PostgreSQL to print absolute paths in error
messages, instead of paths relative to the server datadir?
I'd prefer to do it slightly differently, but I like the general idea.
I'd instead want to write:
FATAL: 58P01: could not create lock file
"local/pgsql/data/.s.PGSQL.5432.lock"
(cwd="/home/tianyin/postgresql-9.2.1/local/pgsql/data/"): No such file
or directory
By spelling out the CWD explicitly there's no impression given that the
user ever actually specified the whole path in the configuration
anywhere, so it's clearer how the path came to be. It also shows which
part of the path is known-good (since it's the CWD) and which could be
at issue.
--
Craig Ringer
Hi, Craig,
Thanks a lot for your response. Sorry to make the bug report so long... I
was a little pissed off by myself for my stupidness after spending around 2
hours on this issue.
Yes, your message is definitely better and explicit by pointing out CWD.
The only concern to me is whether or not CWD is a common abbreviation,
especially for non-native speakers. Searching "CWD" on google, I didn't
find it refers to Current Working Directory in the first 3 returned pages.
To me, printing an explicit log message is the easiest way to solve the
problem. But the configuration file design is still not consistent (maybe
I'm too picky). In the following settings, we have to tell users that the
first 3 works and the last 1 does not work because of chdir blah blah blah.
data_directory = 'local/pgsql/data/'
hba_file = 'local/pgsql/data/pg_hba.conf'
ident_file = 'local/pgsql/data/pg_ident.conf'
unix_socket_directory = 'local/pgsql/data/'
Thanks,
Tianyin
On Mon, Oct 29, 2012 at 7:46 PM, Craig Ringer <ringerc@ringerc.id.au> wrote:
On 10/28/2012 03:35 PM, tixu@cs.ucsd.edu wrote:
I really suggest to make the configuration file consistent in this case.
But
I understand it might not be easy. But at least I think we should print a
better log message which pinpoints to the absolute path likeFATAL: 58P01: could not create lock file
"/home/tianyin/postgresql-9.2.1/local/pgsql/data/local/pgsql/data/.s.PGSQL.5432.lock":
No such file or directory
Essentially, you want PostgreSQL to print absolute paths in error
messages, instead of paths relative to the server datadir?I'd prefer to do it slightly differently, but I like the general idea.
I'd instead want to write:FATAL: 58P01: could not create lock file
"local/pgsql/data/.s.PGSQL.5432.lock"
(cwd="/home/tianyin/postgresql-9.2.1/local/pgsql/data/"): No such file
or directoryBy spelling out the CWD explicitly there's no impression given that the
user ever actually specified the whole path in the configuration
anywhere, so it's clearer how the path came to be. It also shows which
part of the path is known-good (since it's the CWD) and which could be
at issue.--
Craig Ringer
--
Tianyin XU,
http://cseweb.ucsd.edu/~tixu/