pg_rewind failure by file deletion in source server

Started by Fujii Masaoalmost 11 years ago38 messageshackers

masao.fujii@gmail.com

almost 11 years ago

Hi,

While testing pg_rewind, I got the following error and pg_rewind failed.

$ pg_rewind -D ... --source-server="..." -P
ERROR: could not open file "base/13243/16384" for reading: No
such file or directory
STATEMENT: SELECT path, begin,
pg_read_binary_file(path, begin, len) AS chunk
FROM fetchchunks

As far as I read the pg_rewind code, ISTM that the file deletion in
source server while pg_rewind is running can cause pg_rewind to fail.
That is, at first pg_rewind picks up the files to copy (or do some actions)
and creates the file map. Then it performs the actual operation (e.g.,
file copy from source to dest) according to the file map. The problem
can happen if the source server deletes the file listed in the file map
before pg_rewind performs the actual operations. The copy of the file
must fail because it's not found in source server, and then pg_rewind
exits with an error.

Shouldn't pg_rewind ignore that failure of operation? If the file is not
found in source server, the file doesn't need to be copied to destination
server obviously. So ISTM that pg_rewind safely can skip copying that file.
Thought?

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Michael Paquier

michael@paquier.xyz

almost 11 years ago

In reply to: Fujii Masao (#1)

Re: pg_rewind failure by file deletion in source server

On Thu, Jun 11, 2015 at 1:51 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

Shouldn't pg_rewind ignore that failure of operation? If the file is not
found in source server, the file doesn't need to be copied to destination
server obviously. So ISTM that pg_rewind safely can skip copying that file.
Thought?

I think that you should fail. Let's imagine that the master to be
rewound has removed a relation file before being stopped cleanly after
its standby has been promoted that was here at the last checkpoint
before forking, and that the standby still has the relation file after
promotion. You should be able to copy it to be able to replay WAL on
it. If the standby has removed a file in the file map after taking the
file map, I guess that the best thing to do is fail because the file
that should be here for the rewound node cannot be fetched.
Documentation should be made clearer about that with a better error
message...
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Fujii Masao

masao.fujii@gmail.com

almost 11 years ago

In reply to: Michael Paquier (#2)

Re: pg_rewind failure by file deletion in source server

On Thu, Jun 11, 2015 at 2:14 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Thu, Jun 11, 2015 at 1:51 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

Shouldn't pg_rewind ignore that failure of operation? If the file is not
found in source server, the file doesn't need to be copied to destination
server obviously. So ISTM that pg_rewind safely can skip copying that file.
Thought?

I think that you should fail. Let's imagine that the master to be
rewound has removed a relation file before being stopped cleanly after
its standby has been promoted that was here at the last checkpoint
before forking, and that the standby still has the relation file after
promotion. You should be able to copy it to be able to replay WAL on
it. If the standby has removed a file in the file map after taking the
file map, I guess that the best thing to do is fail because the file
that should be here for the rewound node cannot be fetched.

In this case, why do you think that the file should exist in the old master?
Even if it doesn't exist, ISTM that the old master can safely replay the WAL
records related to the file when it restarts. So what's the problem
if the file doesn't exist in the old master?

Documentation should be made clearer about that with a better error
message...

I'm wondering how we can recover (or rewind again) the old master from
that error. This also would need to be documented if we decide not to
fix any code regarding this problem...

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Michael Paquier

michael@paquier.xyz

almost 11 years ago

In reply to: Fujii Masao (#3)

Re: pg_rewind failure by file deletion in source server

On Thu, Jun 11, 2015 at 5:48 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Thu, Jun 11, 2015 at 2:14 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Thu, Jun 11, 2015 at 1:51 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

Shouldn't pg_rewind ignore that failure of operation? If the file is not
found in source server, the file doesn't need to be copied to destination
server obviously. So ISTM that pg_rewind safely can skip copying that file.
Thought?

I think that you should fail. Let's imagine that the master to be
rewound has removed a relation file before being stopped cleanly after
its standby has been promoted that was here at the last checkpoint
before forking, and that the standby still has the relation file after
promotion. You should be able to copy it to be able to replay WAL on
it. If the standby has removed a file in the file map after taking the
file map, I guess that the best thing to do is fail because the file
that should be here for the rewound node cannot be fetched.

In this case, why do you think that the file should exist in the old master?
Even if it doesn't exist, ISTM that the old master can safely replay the WAL
records related to the file when it restarts. So what's the problem
if the file doesn't exist in the old master?

Well, some user may want to rewind the master down to the point where
WAL forked, and then recover it immediately when a consistent point is
reached just at restart instead of replugging it into the cluster. In
this case I think that you need the relation file of the dropped
relation to get a consistent state. That's still cheaper than
recreating a node from a fresh base backup in some cases, particularly
if the last base backup taken is far in the past for this cluster.

Documentation should be made clearer about that with a better error
message...

I'm wondering how we can recover (or rewind again) the old master from
that error. This also would need to be documented if we decide not to
fix any code regarding this problem...

FWIW, here is a scenario able to trigger the error with 1 master (port
5432, data at ~/data/5432) and 1 standby (port 5433, data at
~/data/5433).
$ psql -c 'create table aa as select generate_series(1,1000000)'
# Promote standby
$ pg_ctl promote -D ~/data/5433/
# Drop table on master
$ psql -c 'drop table aa'
DROP TABLE
$ pg_ctl stop -D ~/data/5432/

At this point there is no more relation file on master for 'aa', it is
still present on standby. Running pg_rewind at this point will work,
the relation file would be copied from the promoted standby to master.

$ lldb -- pg_rewind -D 5432 --source-server="port=5433 dbname=postgres"
Breakpoint pg_rewind after fetchSourceFileList() and before replaying
the changes from the block map, drop table 'aa' on standby and
checkpoint it, then the source file list is inconsistent and pg_rewind
will fail. This can just happen with --source-server, with
--source-pgdata

Adding a sleep() of a couple of seconds in pg_rewind may be better to
trigger directly the error ;), with DROP DATABASE for example.

Regards,
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Fujii Masao

masao.fujii@gmail.com

almost 11 years ago

In reply to: Michael Paquier (#4)

Re: pg_rewind failure by file deletion in source server

On Fri, Jun 12, 2015 at 3:17 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Thu, Jun 11, 2015 at 5:48 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Thu, Jun 11, 2015 at 2:14 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Thu, Jun 11, 2015 at 1:51 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

Shouldn't pg_rewind ignore that failure of operation? If the file is not
found in source server, the file doesn't need to be copied to destination
server obviously. So ISTM that pg_rewind safely can skip copying that file.
Thought?

I think that you should fail. Let's imagine that the master to be
rewound has removed a relation file before being stopped cleanly after
its standby has been promoted that was here at the last checkpoint
before forking, and that the standby still has the relation file after
promotion. You should be able to copy it to be able to replay WAL on
it. If the standby has removed a file in the file map after taking the
file map, I guess that the best thing to do is fail because the file
that should be here for the rewound node cannot be fetched.

In this case, why do you think that the file should exist in the old master?
Even if it doesn't exist, ISTM that the old master can safely replay the WAL
records related to the file when it restarts. So what's the problem
if the file doesn't exist in the old master?

Well, some user may want to rewind the master down to the point where
WAL forked, and then recover it immediately when a consistent point is
reached just at restart instead of replugging it into the cluster. In
this case I think that you need the relation file of the dropped
relation to get a consistent state. That's still cheaper than
recreating a node from a fresh base backup in some cases, particularly
if the last base backup taken is far in the past for this cluster.

So it's the case where a user wants to recover old master up to the point
BEFORE the file in question is deleted in new master. At that point,
since the file must exist, pg_rewind should fail if the file cannot be copied
from new master. Is my understanding right?

As far as I read the code of pg_rewind, ISTM that your scenario never happens.
Because pg_rewind sets the minimum recovery point to the latest WAL location
in new master, i.e., AFTER the file is deleted. So old master cannot stop
recovering before the file is deleted in new master. If the recovery stops
at that point, it fails because the minimum recovery point is not reached yet.

IOW, after pg_rewind runs, the old master has to replay the WAL records
which were generated by the deletion of the file in the new master.
So it's okay if the old master doesn't have the file after pg_rewind runs,
I think.

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Michael Paquier

michael@paquier.xyz

almost 11 years ago

In reply to: Fujii Masao (#5)

Re: pg_rewind failure by file deletion in source server

On Fri, Jun 12, 2015 at 3:50 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Fri, Jun 12, 2015 at 3:17 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Thu, Jun 11, 2015 at 5:48 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Thu, Jun 11, 2015 at 2:14 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Thu, Jun 11, 2015 at 1:51 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

Shouldn't pg_rewind ignore that failure of operation? If the file is not
found in source server, the file doesn't need to be copied to destination
server obviously. So ISTM that pg_rewind safely can skip copying that file.
Thought?

I think that you should fail. Let's imagine that the master to be
rewound has removed a relation file before being stopped cleanly after
its standby has been promoted that was here at the last checkpoint
before forking, and that the standby still has the relation file after
promotion. You should be able to copy it to be able to replay WAL on
it. If the standby has removed a file in the file map after taking the
file map, I guess that the best thing to do is fail because the file
that should be here for the rewound node cannot be fetched.

In this case, why do you think that the file should exist in the old master?
Even if it doesn't exist, ISTM that the old master can safely replay the WAL
records related to the file when it restarts. So what's the problem
if the file doesn't exist in the old master?

Well, some user may want to rewind the master down to the point where
WAL forked, and then recover it immediately when a consistent point is
reached just at restart instead of replugging it into the cluster. In
this case I think that you need the relation file of the dropped
relation to get a consistent state. That's still cheaper than
recreating a node from a fresh base backup in some cases, particularly
if the last base backup taken is far in the past for this cluster.

So it's the case where a user wants to recover old master up to the point
BEFORE the file in question is deleted in new master. At that point,
since the file must exist, pg_rewind should fail if the file cannot be copied
from new master. Is my understanding right?

Yep. We are on the same line.

As far as I read the code of pg_rewind, ISTM that your scenario never happens.
Because pg_rewind sets the minimum recovery point to the latest WAL location
in new master, i.e., AFTER the file is deleted. So old master cannot stop
recovering before the file is deleted in new master. If the recovery stops
at that point, it fails because the minimum recovery point is not reached yet.

IOW, after pg_rewind runs, the old master has to replay the WAL records
which were generated by the deletion of the file in the new master.
So it's okay if the old master doesn't have the file after pg_rewind runs,
I think.

Ah, right. I withdraw, indeed what I thought can not happen:
/*
* Update control file of target. Make it ready to perform archive
* recovery when restarting.
*
* minRecoveryPoint is set to the current WAL insert location in the
* source server. Like in an online backup, it's important
that we recover
* all the WAL that was generated while we copied the files over.
*/
So a rewound node will replay WAL up to the current insert location of
the source, and will fail at recovery if recovery target is older than
this insert location..

You want to draft a patch? Should I? I think that we should have a
test case as well in pg_rewind/t/.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Fujii Masao

masao.fujii@gmail.com

almost 11 years ago

In reply to: Michael Paquier (#6)

Re: pg_rewind failure by file deletion in source server

On Fri, Jun 12, 2015 at 4:29 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Fri, Jun 12, 2015 at 3:50 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Fri, Jun 12, 2015 at 3:17 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Thu, Jun 11, 2015 at 5:48 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Thu, Jun 11, 2015 at 2:14 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Thu, Jun 11, 2015 at 1:51 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

Shouldn't pg_rewind ignore that failure of operation? If the file is not
found in source server, the file doesn't need to be copied to destination
server obviously. So ISTM that pg_rewind safely can skip copying that file.
Thought?

I think that you should fail. Let's imagine that the master to be
rewound has removed a relation file before being stopped cleanly after
its standby has been promoted that was here at the last checkpoint
before forking, and that the standby still has the relation file after
promotion. You should be able to copy it to be able to replay WAL on
it. If the standby has removed a file in the file map after taking the
file map, I guess that the best thing to do is fail because the file
that should be here for the rewound node cannot be fetched.

In this case, why do you think that the file should exist in the old master?
Even if it doesn't exist, ISTM that the old master can safely replay the WAL
records related to the file when it restarts. So what's the problem
if the file doesn't exist in the old master?

Well, some user may want to rewind the master down to the point where
WAL forked, and then recover it immediately when a consistent point is
reached just at restart instead of replugging it into the cluster. In
this case I think that you need the relation file of the dropped
relation to get a consistent state. That's still cheaper than
recreating a node from a fresh base backup in some cases, particularly
if the last base backup taken is far in the past for this cluster.

So it's the case where a user wants to recover old master up to the point
BEFORE the file in question is deleted in new master. At that point,
since the file must exist, pg_rewind should fail if the file cannot be copied
from new master. Is my understanding right?

Yep. We are on the same line.

As far as I read the code of pg_rewind, ISTM that your scenario never happens.
Because pg_rewind sets the minimum recovery point to the latest WAL location
in new master, i.e., AFTER the file is deleted. So old master cannot stop
recovering before the file is deleted in new master. If the recovery stops
at that point, it fails because the minimum recovery point is not reached yet.

IOW, after pg_rewind runs, the old master has to replay the WAL records
which were generated by the deletion of the file in the new master.
So it's okay if the old master doesn't have the file after pg_rewind runs,
I think.

Ah, right. I withdraw, indeed what I thought can not happen:
/*
* Update control file of target. Make it ready to perform archive
* recovery when restarting.
*
* minRecoveryPoint is set to the current WAL insert location in the
* source server. Like in an online backup, it's important
that we recover
* all the WAL that was generated while we copied the files over.
*/
So a rewound node will replay WAL up to the current insert location of
the source, and will fail at recovery if recovery target is older than
this insert location..

You want to draft a patch? Should I?

Please feel free to try that! :)

I think that we should have a
test case as well in pg_rewind/t/.

Maybe.

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Michael Paquier

michael@paquier.xyz

almost 11 years ago

In reply to: Fujii Masao (#7)

Re: pg_rewind failure by file deletion in source server

On Fri, Jun 12, 2015 at 9:02 PM, Fujii Masao wrote:

You want to draft a patch? Should I?

Please feel free to try that! :)

OK, so attached are a patch and a test case able to trigger easily the
error. Apply the patch and run the test case to reproduce the
following failure:
$ ERROR: could not open file "base/16384/16385_fsm" for reading: No
such file or directory
STATEMENT: SELECT path, begin,
pg_read_binary_file(path, begin, len) AS chunk
FROM fetchchunks
The patch adds a call to pg_usleep after the list of files from source
server has been fetched with libpq in pg_rewind.c to let time to run
some DROP actions, like DROP DATABASE, DROP TABLE, etc in order to
trigger the error easily.

In order to reduce the risk of failure to a minimum and to preserve
the performance of the tool when using --source-server, I think that
we should add some check using pg_stat_file to see if a file is still
present or not, and if it is missing we can safely skip it thanks to
minRecoveryPoint. Now the problem is that pg_stat_file fails
automatically if the file targeted is missing. Hence, to avoid a bunch
of round trips with the server with one call to pg_stat_dir per file,
I think that we should add some if_not_exists option to pg_stat_file,
defaulting to false, to skip the error related to the file missing and
have it return NULL in this case. Then we could use this filter on the
file path in libpq_executeFileMap() to fetch only the file chunks that
actually exist on the server. Note that we could as well use some
plpgsql-ing to do the same, but the extension of pg_stat_file looks
more useful to me. Thoughts?
--
Michael

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

almost 11 years ago

In reply to: Michael Paquier (#8)

Re: pg_rewind failure by file deletion in source server

On 06/16/2015 02:04 AM, Michael Paquier wrote:

In order to reduce the risk of failure to a minimum and to preserve
the performance of the tool when using --source-server, I think that
we should add some check using pg_stat_file to see if a file is still
present or not, and if it is missing we can safely skip it thanks to
minRecoveryPoint. Now the problem is that pg_stat_file fails
automatically if the file targeted is missing. Hence, to avoid a bunch
of round trips with the server with one call to pg_stat_dir per file,
I think that we should add some if_not_exists option to pg_stat_file,
defaulting to false, to skip the error related to the file missing and
have it return NULL in this case. Then we could use this filter on the
file path in libpq_executeFileMap() to fetch only the file chunks that
actually exist on the server.

You'll also need to add the option to pg_read_binary_file, though,
because even if you do a test with pg_stat_file() just before reading
the file, there's a race condition: someone might still delete file
between pg_stat_file() and pg_read_file().

Listing the directories with pg_ls_dir() has the same problem. As does
pg_tablespace_location().

Note that we could as well use some plpgsql-ing to do the same, but
the extension of pg_stat_file looks more useful to me. Thoughts?

Hmm. You'll need to add the option to all of those functions. Maybe it's
nevertheless the simplest approach.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10

Michael Paquier

michael@paquier.xyz

almost 11 years ago

In reply to: Heikki Linnakangas (#9)

Re: pg_rewind failure by file deletion in source server

On Thu, Jun 18, 2015 at 10:55 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 06/16/2015 02:04 AM, Michael Paquier wrote:

In order to reduce the risk of failure to a minimum and to preserve
the performance of the tool when using --source-server, I think that
we should add some check using pg_stat_file to see if a file is still
present or not, and if it is missing we can safely skip it thanks to
minRecoveryPoint. Now the problem is that pg_stat_file fails
automatically if the file targeted is missing. Hence, to avoid a bunch
of round trips with the server with one call to pg_stat_dir per file,
I think that we should add some if_not_exists option to pg_stat_file,
defaulting to false, to skip the error related to the file missing and
have it return NULL in this case. Then we could use this filter on the
file path in libpq_executeFileMap() to fetch only the file chunks that
actually exist on the server.

You'll also need to add the option to pg_read_binary_file, though, because
even if you do a test with pg_stat_file() just before reading the file,
there's a race condition: someone might still delete file between
pg_stat_file() and pg_read_file().

I propose to return NULL values if the file does not exist and
if_not_exists = true for both of them. Does that sound fine?

Listing the directories with pg_ls_dir() has the same problem.

(After some discussion on IM with Heikki on this one).
This is actually more tricky because pg_ls_dir() does not return '.'
or '..' that we could use to identify that the directory actually
exists or not when it is empty. Hence I think that we should add two
options to pg_ls_dir:
- include_self, default to false. If set to true, '.' is added in the
list of items.
- if_not_exists, to bypass error that a folder does not exist, default
at false. If if_not_exists = true and include_self = true, returning
only '.' would mean that the folder exist but that it is empty. If
if_not_exists = true and include_self = false, no rows are returned.
We could as well ERROR as well if both options are set like that. I am
fine with any of them as long as behavior is properly documented.

As does pg_tablespace_location().

NULL if tablespace path does not exist anymore. Is that fine.

Note that we could as well use some plpgsql-ing to do the same, but
the extension of pg_stat_file looks more useful to me. Thoughts?

Hmm. You'll need to add the option to all of those functions. Maybe it's
nevertheless the simplest approach.

With plpgsql you could use a try/catch/raising block to do the work.
But it still looks better to me to have alternative options with the
in-core functions.

I am fine to spend time on all those things and provide test cases,
let's just get a precise picture of what we want first.
Regards,
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11

Robert Haas

robertmhaas@gmail.com

almost 11 years ago

In reply to: Michael Paquier (#10)

Re: pg_rewind failure by file deletion in source server

On Fri, Jun 19, 2015 at 12:14 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:

Listing the directories with pg_ls_dir() has the same problem.

(After some discussion on IM with Heikki on this one).
This is actually more tricky because pg_ls_dir() does not return '.'
or '..' that we could use to identify that the directory actually
exists or not when it is empty. Hence I think that we should add two
options to pg_ls_dir:
- include_self, default to false. If set to true, '.' is added in the
list of items.
- if_not_exists, to bypass error that a folder does not exist, default
at false. If if_not_exists = true and include_self = true, returning
only '.' would mean that the folder exist but that it is empty. If
if_not_exists = true and include_self = false, no rows are returned.
We could as well ERROR as well if both options are set like that. I am
fine with any of them as long as behavior is properly documented.

Including '.' to distinguish between an empty directory and a
nonexistent one seems like an unnecessarily complicated and
non-obvious API. How about just one additional parameter bool
*exists. If NULL and no directory, ERROR, else on return set *exists
to true or false.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12

Robert Haas

robertmhaas@gmail.com

almost 11 years ago

In reply to: Robert Haas (#11)

Re: pg_rewind failure by file deletion in source server

On Fri, Jun 19, 2015 at 8:18 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Jun 19, 2015 at 12:14 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:

Listing the directories with pg_ls_dir() has the same problem.

(After some discussion on IM with Heikki on this one).
This is actually more tricky because pg_ls_dir() does not return '.'
or '..' that we could use to identify that the directory actually
exists or not when it is empty. Hence I think that we should add two
options to pg_ls_dir:
- include_self, default to false. If set to true, '.' is added in the
list of items.
- if_not_exists, to bypass error that a folder does not exist, default
at false. If if_not_exists = true and include_self = true, returning
only '.' would mean that the folder exist but that it is empty. If
if_not_exists = true and include_self = false, no rows are returned.
We could as well ERROR as well if both options are set like that. I am
fine with any of them as long as behavior is properly documented.

Including '.' to distinguish between an empty directory and a
nonexistent one seems like an unnecessarily complicated and
non-obvious API. How about just one additional parameter bool
*exists. If NULL and no directory, ERROR, else on return set *exists
to true or false.

Err, wait. You're talking about an SQL function, heh heh. So that
won't work. Maybe what you proposed is the best we can do, then,
although I would suggest that you rename the include_self parameter to
something like include_dot_dirs and return both "." and "..".
Returning only "." seems like it will seem weird to people.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13

Michael Paquier

michael@paquier.xyz

almost 11 years ago

In reply to: Robert Haas (#12)

Re: pg_rewind failure by file deletion in source server

On Fri, Jun 19, 2015 at 9:22 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Jun 19, 2015 at 8:18 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Jun 19, 2015 at 12:14 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:

Listing the directories with pg_ls_dir() has the same problem.

(After some discussion on IM with Heikki on this one).
This is actually more tricky because pg_ls_dir() does not return '.'
or '..' that we could use to identify that the directory actually
exists or not when it is empty. Hence I think that we should add two
options to pg_ls_dir:
- include_self, default to false. If set to true, '.' is added in the
list of items.
- if_not_exists, to bypass error that a folder does not exist, default
at false. If if_not_exists = true and include_self = true, returning
only '.' would mean that the folder exist but that it is empty. If
if_not_exists = true and include_self = false, no rows are returned.
We could as well ERROR as well if both options are set like that. I am
fine with any of them as long as behavior is properly documented.

Including '.' to distinguish between an empty directory and a
nonexistent one seems like an unnecessarily complicated and
non-obvious API. How about just one additional parameter bool
*exists. If NULL and no directory, ERROR, else on return set *exists
to true or false.

Err, wait. You're talking about an SQL function, heh heh. So that
won't work. Maybe what you proposed is the best we can do, then,
although I would suggest that you rename the include_self parameter to
something like include_dot_dirs and return both "." and "..".
Returning only "." seems like it will seem weird to people.

So... Attached are a set of patches dedicated at fixing this issue:
- 0001, add if_not_exists to pg_tablespace_location, returning NULL if
path does not exist
- 0002, same with pg_stat_file, returning NULL if file does not exist
- 0003, same with pg_read_*file. I added them to all the existing
functions for consistency.
- 0004, pg_ls_dir extended with if_not_exists and include_dot_dirs
(thanks Robert for the naming!)
- 0005, as things get complex, a set of regression tests aimed to
covering those things. pg_tablespace_location is platform-dependent,
so there are no tests for it.
- 0006, the fix for pg_rewind, using what has been implemented before.

Attached is an updated test case triggering the issue
(rewind_test.bash), with the small patch attached that adds a pg_sleep
call in pg_rewind.c (20150623_pg_rewind_sleep.patch).

I imagine that this is a bug people are going to meet in the field
easily, particularly with temporary relation files or temporary XLOG
files.
Regards,
--
Michael

#14

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

almost 11 years ago

In reply to: Michael Paquier (#13)

Re: pg_rewind failure by file deletion in source server

On 06/23/2015 07:51 AM, Michael Paquier wrote:

So... Attached are a set of patches dedicated at fixing this issue:

Thanks for working on this!

- 0001, add if_not_exists to pg_tablespace_location, returning NULL if
path does not exist
- 0002, same with pg_stat_file, returning NULL if file does not exist
- 0003, same with pg_read_*file. I added them to all the existing
functions for consistency.
- 0004, pg_ls_dir extended with if_not_exists and include_dot_dirs
(thanks Robert for the naming!)
- 0005, as things get complex, a set of regression tests aimed to
covering those things. pg_tablespace_location is platform-dependent,
so there are no tests for it.
- 0006, the fix for pg_rewind, using what has been implemented before.

With thes patches, pg_read_file() will return NULL for any failure to
open the file, which makes pg_rewind to assume that the file doesn't
exist in the source server, and will remove the file from the
destination. That's dangerous, those functions should check specifically
for ENOENT.

There's still a small race condition with tablespaces. If you run CREATE
TABLESPACE in the source server while pg_rewind is running, it's
possible that the recursive query that pg_rewind uses sees the symlink
in pg_tblspc/ directory, but its snapshot doesn't see the row in
pg_tablespace yet. It will think that the symlink is a regular file, try
to read it, and fail (if we checked for ENOENT).

Actually, I think we need try to deal with symlinks a bit harder.
Currently, pg_rewind assumes that anything in pg_tblspace that has a
matching row in pg_tablespace is a symlink, and nothing else is. I think
symlinks to directories. I just noticed that pg_rewind fails miserable
if pg_xlog is a symlink, because of that:

----
The servers diverged at WAL position 0/3023F08 on timeline 1.
Rewinding from last common checkpoint at 0/2000060 on timeline 1

"data-master//pg_xlog" is not a directory
Failure, exiting
----

I think we need to add a column to pg_stat_file output, to indicate
symbolic links, and add a pg_readlink() function. That still leaves a
race condition if the type of a file changes, i.e. a file is deleted and
a directory with the same name is created in its place, but that seems
acceptable. I don't think PostgreSQL ever does such a thing, so that
could only happen if you mess with the data directory manually while the
server is running.

I just realized another problem: We recently learned the hard way that
some people have files in the data directory that are not writeable by
the 'postgres' user
(/messages/by-id/20150523172627.GA24277@msg.df7cb.de).
pg_rewind will try to overwrite all files it doesn't recognize as
relation files, so it's going to fail on those. A straightforward fix
would be to first open the destination file in read-only mode, and
compare its contents, and only open the file in write mode if it has
changed. It would still fail when the files really differ, but I think
that's acceptable.

I note that pg_rewind doesn't need to distinguish between an empty and a
non-existent directory, so it's quite silly for it to pass
include_dot_dirs=true, and then filter out "." and ".." from the result
set. The documentation should mention the main reason for including "."
and "..": to distinguish between an empty and non-existent directory.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15

Fujii Masao

masao.fujii@gmail.com

almost 11 years ago

In reply to: Heikki Linnakangas (#14)

Re: pg_rewind failure by file deletion in source server

On Tue, Jun 23, 2015 at 9:19 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 06/23/2015 07:51 AM, Michael Paquier wrote:

So... Attached are a set of patches dedicated at fixing this issue:

Thanks for working on this!

- 0001, add if_not_exists to pg_tablespace_location, returning NULL if
path does not exist
- 0002, same with pg_stat_file, returning NULL if file does not exist
- 0003, same with pg_read_*file. I added them to all the existing
functions for consistency.
- 0004, pg_ls_dir extended with if_not_exists and include_dot_dirs
(thanks Robert for the naming!)
- 0005, as things get complex, a set of regression tests aimed to
covering those things. pg_tablespace_location is platform-dependent,
so there are no tests for it.
- 0006, the fix for pg_rewind, using what has been implemented before.

With thes patches, pg_read_file() will return NULL for any failure to open
the file, which makes pg_rewind to assume that the file doesn't exist in the
source server, and will remove the file from the destination. That's
dangerous, those functions should check specifically for ENOENT.

I'm wondering if using pg_read_file() to copy the file from source server
is reasonable. ISTM that it has two problems as follows.

1. It cannot read very large file like 1GB file. So if such large file was
created in source server after failover, pg_rewind would not be able
to copy the file. No?

2. Many users may not allow a remote client to connect to the
PostgreSQL server as a superuser for some security reasons. IOW,
there would be no entry in pg_hba.conf for such connection.
In this case, pg_rewind always fails because pg_read_file() needs
superuser privilege. No?

I'm tempting to implement the replication command version of
pg_read_file(). That is, it reads and sends the data like BASE_BACKUP
replication command does...

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

almost 11 years ago

In reply to: Fujii Masao (#15)

Re: pg_rewind failure by file deletion in source server

On 06/23/2015 05:03 PM, Fujii Masao wrote:

On Tue, Jun 23, 2015 at 9:19 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 06/23/2015 07:51 AM, Michael Paquier wrote:

So... Attached are a set of patches dedicated at fixing this issue:

Thanks for working on this!

- 0001, add if_not_exists to pg_tablespace_location, returning NULL if
path does not exist
- 0002, same with pg_stat_file, returning NULL if file does not exist
- 0003, same with pg_read_*file. I added them to all the existing
functions for consistency.
- 0004, pg_ls_dir extended with if_not_exists and include_dot_dirs
(thanks Robert for the naming!)
- 0005, as things get complex, a set of regression tests aimed to
covering those things. pg_tablespace_location is platform-dependent,
so there are no tests for it.
- 0006, the fix for pg_rewind, using what has been implemented before.

With thes patches, pg_read_file() will return NULL for any failure to open
the file, which makes pg_rewind to assume that the file doesn't exist in the
source server, and will remove the file from the destination. That's
dangerous, those functions should check specifically for ENOENT.

I'm wondering if using pg_read_file() to copy the file from source server
is reasonable. ISTM that it has two problems as follows.

1. It cannot read very large file like 1GB file. So if such large file was
created in source server after failover, pg_rewind would not be able
to copy the file. No?

pg_read_binary_file() handles large files just fine. It cannot return
more than 1GB in one call, but you can call it several times and
retrieve the file in chunks. That's what pg_rewind does, except for
reading the control file, which is known to be small.

2. Many users may not allow a remote client to connect to the
PostgreSQL server as a superuser for some security reasons. IOW,
there would be no entry in pg_hba.conf for such connection.
In this case, pg_rewind always fails because pg_read_file() needs
superuser privilege. No?

I'm tempting to implement the replication command version of
pg_read_file(). That is, it reads and sends the data like BASE_BACKUP
replication command does...

Yeah, that would definitely be nice. Peter suggested it back in January
(/messages/by-id/54AC4801.7050300@gmx.net). I think
it's way too late to do that for 9.5, however. I'm particularly worried
that if we design the required API in a rush, we're not going to get it
right, and will have to change it again soon. That might be difficult in
a minor release. Using pg_read_file() and friends is quite flexible,
even though we just find out that they're not quite flexible enough
right now (the ENOENT problem).

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17

Fujii Masao

masao.fujii@gmail.com

almost 11 years ago

In reply to: Heikki Linnakangas (#16)

Re: pg_rewind failure by file deletion in source server

On Tue, Jun 23, 2015 at 11:21 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 06/23/2015 05:03 PM, Fujii Masao wrote:

On Tue, Jun 23, 2015 at 9:19 PM, Heikki Linnakangas <hlinnaka@iki.fi>
wrote:

On 06/23/2015 07:51 AM, Michael Paquier wrote:

So... Attached are a set of patches dedicated at fixing this issue:

Thanks for working on this!

- 0001, add if_not_exists to pg_tablespace_location, returning NULL if
path does not exist
- 0002, same with pg_stat_file, returning NULL if file does not exist
- 0003, same with pg_read_*file. I added them to all the existing
functions for consistency.
- 0004, pg_ls_dir extended with if_not_exists and include_dot_dirs
(thanks Robert for the naming!)
- 0005, as things get complex, a set of regression tests aimed to
covering those things. pg_tablespace_location is platform-dependent,
so there are no tests for it.
- 0006, the fix for pg_rewind, using what has been implemented before.

With thes patches, pg_read_file() will return NULL for any failure to
open
the file, which makes pg_rewind to assume that the file doesn't exist in
the
source server, and will remove the file from the destination. That's
dangerous, those functions should check specifically for ENOENT.

I'm wondering if using pg_read_file() to copy the file from source server
is reasonable. ISTM that it has two problems as follows.

1. It cannot read very large file like 1GB file. So if such large file was
created in source server after failover, pg_rewind would not be able
to copy the file. No?

pg_read_binary_file() handles large files just fine. It cannot return more
than 1GB in one call, but you can call it several times and retrieve the
file in chunks. That's what pg_rewind does, except for reading the control
file, which is known to be small.

Yeah, you're right.

I found that pg_rewind creates a temporary table to fetch the file in chunks.
This would prevent pg_rewind from using the *hot standby* server as a source
server at all. This is of course a limitation of pg_rewind, but we might want
to alleviate it in the future.

2. Many users may not allow a remote client to connect to the
PostgreSQL server as a superuser for some security reasons. IOW,
there would be no entry in pg_hba.conf for such connection.
In this case, pg_rewind always fails because pg_read_file() needs
superuser privilege. No?

I'm tempting to implement the replication command version of
pg_read_file(). That is, it reads and sends the data like BASE_BACKUP
replication command does...

Yeah, that would definitely be nice. Peter suggested it back in January
(/messages/by-id/54AC4801.7050300@gmx.net). I think
it's way too late to do that for 9.5, however. I'm particularly worried that
if we design the required API in a rush, we're not going to get it right,
and will have to change it again soon. That might be difficult in a minor
release. Using pg_read_file() and friends is quite flexible, even though we
just find out that they're not quite flexible enough right now (the ENOENT
problem).

I agree that it's too late to do what I said...

But just using pg_read_file() cannot address the #2 problem that I pointed
in my previous email. Also requiring a superuer privilege on pg_rewind
really conflicts with the motivation why we added replication privilege.

So we should change pg_read_file() so that even replication user can
read the file?
Or replication user version of pg_read_file() should be implemented?

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18

Michael Paquier

michael@paquier.xyz

almost 11 years ago

In reply to: Fujii Masao (#17)

Re: pg_rewind failure by file deletion in source server

On Wed, Jun 24, 2015 at 1:40 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Tue, Jun 23, 2015 at 11:21 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 06/23/2015 05:03 PM, Fujii Masao wrote:

On Tue, Jun 23, 2015 at 9:19 PM, Heikki Linnakangas <hlinnaka@iki.fi>
wrote:

On 06/23/2015 07:51 AM, Michael Paquier wrote:

So... Attached are a set of patches dedicated at fixing this issue:

Thanks for working on this!

- 0001, add if_not_exists to pg_tablespace_location, returning NULL if
path does not exist
- 0002, same with pg_stat_file, returning NULL if file does not exist
- 0003, same with pg_read_*file. I added them to all the existing
functions for consistency.
- 0004, pg_ls_dir extended with if_not_exists and include_dot_dirs
(thanks Robert for the naming!)
- 0005, as things get complex, a set of regression tests aimed to
covering those things. pg_tablespace_location is platform-dependent,
so there are no tests for it.
- 0006, the fix for pg_rewind, using what has been implemented before.

With thes patches, pg_read_file() will return NULL for any failure to
open
the file, which makes pg_rewind to assume that the file doesn't exist in
the
source server, and will remove the file from the destination. That's
dangerous, those functions should check specifically for ENOENT.

I'm wondering if using pg_read_file() to copy the file from source server
is reasonable. ISTM that it has two problems as follows.

1. It cannot read very large file like 1GB file. So if such large file was
created in source server after failover, pg_rewind would not be able
to copy the file. No?

pg_read_binary_file() handles large files just fine. It cannot return more
than 1GB in one call, but you can call it several times and retrieve the
file in chunks. That's what pg_rewind does, except for reading the control
file, which is known to be small.

Yeah, you're right.

I found that pg_rewind creates a temporary table to fetch the file in chunks.
This would prevent pg_rewind from using the *hot standby* server as a source
server at all. This is of course a limitation of pg_rewind, but we might want
to alleviate it in the future.

This is something that a replication command could address properly.

2. Many users may not allow a remote client to connect to the
PostgreSQL server as a superuser for some security reasons. IOW,
there would be no entry in pg_hba.conf for such connection.
In this case, pg_rewind always fails because pg_read_file() needs
superuser privilege. No?

I'm tempting to implement the replication command version of
pg_read_file(). That is, it reads and sends the data like BASE_BACKUP
replication command does...

Yeah, that would definitely be nice. Peter suggested it back in January
(/messages/by-id/54AC4801.7050300@gmx.net). I think
it's way too late to do that for 9.5, however. I'm particularly worried that
if we design the required API in a rush, we're not going to get it right,
and will have to change it again soon. That might be difficult in a minor
release. Using pg_read_file() and friends is quite flexible, even though we
just find out that they're not quite flexible enough right now (the ENOENT
problem).

I agree that it's too late to do what I said...

But just using pg_read_file() cannot address the #2 problem that I pointed
in my previous email. Also requiring a superuser privilege on pg_rewind
really conflicts with the motivation why we added replication privilege.

So we should change pg_read_file() so that even replication user can
read the file?

From the security prospective, a replication user can take a base
backup so it can already retrieve easily the contents of PGDATA. Hence
I guess that it would be fine. However, what about cases where
pg_hba.conf authorizes access to a given replication user via psql and
blocks it for the replication protocol? We could say that OP should
not give out replication access that easily, but in this case the user
would have access to the content of PGDATA even if he should not. Is
that unrealistic?

Or replication user version of pg_read_file() should be implemented?

You mean a new function? In what is it different from authorizing
pg_read_file usage for a replication user?

Honestly, I can live with this superuser restriction in 9.5. And come
back to the replication user restriction in 9.6 once things cool down
a bit.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19

Fujii Masao

masao.fujii@gmail.com

almost 11 years ago

In reply to: Michael Paquier (#18)

Re: pg_rewind failure by file deletion in source server

On Wed, Jun 24, 2015 at 3:36 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Wed, Jun 24, 2015 at 1:40 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Tue, Jun 23, 2015 at 11:21 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 06/23/2015 05:03 PM, Fujii Masao wrote:

On Tue, Jun 23, 2015 at 9:19 PM, Heikki Linnakangas <hlinnaka@iki.fi>
wrote:

On 06/23/2015 07:51 AM, Michael Paquier wrote:

So... Attached are a set of patches dedicated at fixing this issue:

Thanks for working on this!

- 0001, add if_not_exists to pg_tablespace_location, returning NULL if
path does not exist
- 0002, same with pg_stat_file, returning NULL if file does not exist
- 0003, same with pg_read_*file. I added them to all the existing
functions for consistency.
- 0004, pg_ls_dir extended with if_not_exists and include_dot_dirs
(thanks Robert for the naming!)
- 0005, as things get complex, a set of regression tests aimed to
covering those things. pg_tablespace_location is platform-dependent,
so there are no tests for it.
- 0006, the fix for pg_rewind, using what has been implemented before.

With thes patches, pg_read_file() will return NULL for any failure to
open
the file, which makes pg_rewind to assume that the file doesn't exist in
the
source server, and will remove the file from the destination. That's
dangerous, those functions should check specifically for ENOENT.

I'm wondering if using pg_read_file() to copy the file from source server
is reasonable. ISTM that it has two problems as follows.

1. It cannot read very large file like 1GB file. So if such large file was
created in source server after failover, pg_rewind would not be able
to copy the file. No?

pg_read_binary_file() handles large files just fine. It cannot return more
than 1GB in one call, but you can call it several times and retrieve the
file in chunks. That's what pg_rewind does, except for reading the control
file, which is known to be small.

Yeah, you're right.

I found that pg_rewind creates a temporary table to fetch the file in chunks.
This would prevent pg_rewind from using the *hot standby* server as a source
server at all. This is of course a limitation of pg_rewind, but we might want
to alleviate it in the future.

This is something that a replication command could address properly.

2. Many users may not allow a remote client to connect to the
PostgreSQL server as a superuser for some security reasons. IOW,
there would be no entry in pg_hba.conf for such connection.
In this case, pg_rewind always fails because pg_read_file() needs
superuser privilege. No?

I'm tempting to implement the replication command version of
pg_read_file(). That is, it reads and sends the data like BASE_BACKUP
replication command does...

Yeah, that would definitely be nice. Peter suggested it back in January
(/messages/by-id/54AC4801.7050300@gmx.net). I think
it's way too late to do that for 9.5, however. I'm particularly worried that
if we design the required API in a rush, we're not going to get it right,
and will have to change it again soon. That might be difficult in a minor
release. Using pg_read_file() and friends is quite flexible, even though we
just find out that they're not quite flexible enough right now (the ENOENT
problem).

I agree that it's too late to do what I said...

But just using pg_read_file() cannot address the #2 problem that I pointed
in my previous email. Also requiring a superuser privilege on pg_rewind
really conflicts with the motivation why we added replication privilege.

So we should change pg_read_file() so that even replication user can
read the file?

From the security prospective, a replication user can take a base
backup so it can already retrieve easily the contents of PGDATA. Hence
I guess that it would be fine. However, what about cases where
pg_hba.conf authorizes access to a given replication user via psql and
blocks it for the replication protocol? We could say that OP should
not give out replication access that easily, but in this case the user
would have access to the content of PGDATA even if he should not. Is
that unrealistic?

The most realistic case is that both source and target servers have
the pg_hba.conf containing the following authentication setting
regarding replication. That is, each server allows other to use the
replication user to connect to via replication protocol.

# TYPE DATABASE USER ADDRESS METHOD
host replication repuser X.X.X.X/Y md5

This case makes me think that allowing even replication user to
call pg_read_file() may not be good solution for us. Because
in that case replication user needs to log in the real database
like "postgres" to call pg_read_file(), but usually there would be
no entry allowing replication user to connect to any real database in
pg_hba.conf. So if we want to address this problem, replication
command version of pg_read_file() would be required. However,
that's too late to do for now...

Or replication user version of pg_read_file() should be implemented?

You mean a new function? In what is it different from authorizing
pg_read_file usage for a replication user?

Honestly, I can live with this superuser restriction in 9.5. And come
back to the replication user restriction in 9.6 once things cool down
a bit.

Yeah, finally I agree with you. This seems only approach we can adopt
for now.

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20

Michael Paquier

michael@paquier.xyz

almost 11 years ago

In reply to: Heikki Linnakangas (#14)

Re: pg_rewind failure by file deletion in source server

On Tue, Jun 23, 2015 at 9:19 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 06/23/2015 07:51 AM, Michael Paquier wrote:

- 0001, add if_not_exists to pg_tablespace_location, returning NULL if
path does not exist
- 0002, same with pg_stat_file, returning NULL if file does not exist
- 0003, same with pg_read_*file. I added them to all the existing
functions for consistency.
- 0004, pg_ls_dir extended with if_not_exists and include_dot_dirs
(thanks Robert for the naming!)
- 0005, as things get complex, a set of regression tests aimed to
covering those things. pg_tablespace_location is platform-dependent,
so there are no tests for it.
- 0006, the fix for pg_rewind, using what has been implemented before.

With these patches, pg_read_file() will return NULL for any failure to open
the file, which makes pg_rewind to assume that the file doesn't exist in the
source server, and will remove the file from the destination. That's
dangerous, those functions should check specifically for ENOENT.

This makes sense. I changed all those functions to do so.

There's still a small race condition with tablespaces. If you run CREATE
TABLESPACE in the source server while pg_rewind is running, it's possible
that the recursive query that pg_rewind uses sees the symlink in pg_tblspc/
directory, but its snapshot doesn't see the row in pg_tablespace yet. It
will think that the symlink is a regular file, try to read it, and fail (if
we checked for ENOENT).
Actually, I think we need try to deal with symlinks a bit harder. Currently,
pg_rewind assumes that anything in pg_tblspace that has a matching row in
pg_tablespace is a symlink, and nothing else is. I think symlinks to
directories. I just noticed that pg_rewind fails miserable if pg_xlog is a
symlink, because of that:

----
The servers diverged at WAL position 0/3023F08 on timeline 1.
Rewinding from last common checkpoint at 0/2000060 on timeline 1

"data-master//pg_xlog" is not a directory
Failure, exiting
----

It may be possible that in this case the path is a symlink on the
source but not on the target, and vice-versa, so this looks too
restrictive to me if we begin to use pg_readlink. Think for example of
a symlink of pg_xlog that is not included in a base backup, we ignore
it in copy_fetch.c for pg_xlog and the others, I think that here as
well we should ignore those errors except for tablespaces.

I think we need to add a column to pg_stat_file output, to indicate symbolic
links, and add a pg_readlink() function. That still leaves a race condition
if the type of a file changes, i.e. a file is deleted and a directory with
the same name is created in its place, but that seems acceptable. I don't
think PostgreSQL ever does such a thing, so that could only happen if you
mess with the data directory manually while the server is running.

Hm. pg_stat_file uses now stat(), and not lstat() so it cannot make
the difference between what is a link or not. I have changed
pg_stat_file to use lstat instead to cover this case in my set of
patches, but a new function may be a better answer here. I have added
as well a pg_readlink() function on the stack, and actually the
if_not_exists mode of pg_tablespace_location is not needed anymore if
we rely on pg_readlink in this case. I have let it in the set of
patches though. This still looks useful for other utility tools.

I just realized another problem: We recently learned the hard way that some
people have files in the data directory that are not writeable by the
'postgres' user
(/messages/by-id/20150523172627.GA24277@msg.df7cb.de).
pg_rewind will try to overwrite all files it doesn't recognize as relation
files, so it's going to fail on those. A straightforward fix would be to
first open the destination file in read-only mode, and compare its contents,
and only open the file in write mode if it has changed. It would still fail
when the files really differ, but I think that's acceptable.

If I am missing nothing, two code paths need to be patched here:
copy_file_range and receiveFileChunks. copy_file_range is
straight-forward. Now wouldn't it be better to write the contents into
a temporary file, compare their content, and then switch if necessary
for receiveFileChunks?

I note that pg_rewind doesn't need to distinguish between an empty and a
non-existent directory, so it's quite silly for it to pass
include_dot_dirs=true, and then filter out "." and ".." from the result set.
The documentation should mention the main reason for including "." and "..":
to distinguish between an empty and non-existent directory.

OK. Switched to that in the first patch for pg_rewind.

Attached is a new set of patches. Except for the last ones that
addresses one issue of pg_rewind (symlink management when streaming
PGDATA), all the others introduce if_not_exists options for the
functions of genfile.c. The pg_rewind stuff could be more polished
though. Feel free to comment.
Regards,
--
Michael