pg_rewind failure by file deletion in source server

Started by Fujii Masaoover 10 years ago38 messages
#1Fujii Masao
masao.fujii@gmail.com

Hi,

While testing pg_rewind, I got the following error and pg_rewind failed.

$ pg_rewind -D ... --source-server="..." -P
ERROR: could not open file "base/13243/16384" for reading: No
such file or directory
STATEMENT: SELECT path, begin,
pg_read_binary_file(path, begin, len) AS chunk
FROM fetchchunks

As far as I read the pg_rewind code, ISTM that the file deletion in
source server while pg_rewind is running can cause pg_rewind to fail.
That is, at first pg_rewind picks up the files to copy (or do some actions)
and creates the file map. Then it performs the actual operation (e.g.,
file copy from source to dest) according to the file map. The problem
can happen if the source server deletes the file listed in the file map
before pg_rewind performs the actual operations. The copy of the file
must fail because it's not found in source server, and then pg_rewind
exits with an error.

Shouldn't pg_rewind ignore that failure of operation? If the file is not
found in source server, the file doesn't need to be copied to destination
server obviously. So ISTM that pg_rewind safely can skip copying that file.
Thought?

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#2Michael Paquier
michael.paquier@gmail.com
In reply to: Fujii Masao (#1)
Re: pg_rewind failure by file deletion in source server

On Thu, Jun 11, 2015 at 1:51 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

Shouldn't pg_rewind ignore that failure of operation? If the file is not
found in source server, the file doesn't need to be copied to destination
server obviously. So ISTM that pg_rewind safely can skip copying that file.
Thought?

I think that you should fail. Let's imagine that the master to be
rewound has removed a relation file before being stopped cleanly after
its standby has been promoted that was here at the last checkpoint
before forking, and that the standby still has the relation file after
promotion. You should be able to copy it to be able to replay WAL on
it. If the standby has removed a file in the file map after taking the
file map, I guess that the best thing to do is fail because the file
that should be here for the rewound node cannot be fetched.
Documentation should be made clearer about that with a better error
message...
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Fujii Masao
masao.fujii@gmail.com
In reply to: Michael Paquier (#2)
Re: pg_rewind failure by file deletion in source server

On Thu, Jun 11, 2015 at 2:14 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Thu, Jun 11, 2015 at 1:51 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

Shouldn't pg_rewind ignore that failure of operation? If the file is not
found in source server, the file doesn't need to be copied to destination
server obviously. So ISTM that pg_rewind safely can skip copying that file.
Thought?

I think that you should fail. Let's imagine that the master to be
rewound has removed a relation file before being stopped cleanly after
its standby has been promoted that was here at the last checkpoint
before forking, and that the standby still has the relation file after
promotion. You should be able to copy it to be able to replay WAL on
it. If the standby has removed a file in the file map after taking the
file map, I guess that the best thing to do is fail because the file
that should be here for the rewound node cannot be fetched.

In this case, why do you think that the file should exist in the old master?
Even if it doesn't exist, ISTM that the old master can safely replay the WAL
records related to the file when it restarts. So what's the problem
if the file doesn't exist in the old master?

Documentation should be made clearer about that with a better error
message...

I'm wondering how we can recover (or rewind again) the old master from
that error. This also would need to be documented if we decide not to
fix any code regarding this problem...

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Michael Paquier
michael.paquier@gmail.com
In reply to: Fujii Masao (#3)
Re: pg_rewind failure by file deletion in source server

On Thu, Jun 11, 2015 at 5:48 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Thu, Jun 11, 2015 at 2:14 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Thu, Jun 11, 2015 at 1:51 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

Shouldn't pg_rewind ignore that failure of operation? If the file is not
found in source server, the file doesn't need to be copied to destination
server obviously. So ISTM that pg_rewind safely can skip copying that file.
Thought?

I think that you should fail. Let's imagine that the master to be
rewound has removed a relation file before being stopped cleanly after
its standby has been promoted that was here at the last checkpoint
before forking, and that the standby still has the relation file after
promotion. You should be able to copy it to be able to replay WAL on
it. If the standby has removed a file in the file map after taking the
file map, I guess that the best thing to do is fail because the file
that should be here for the rewound node cannot be fetched.

In this case, why do you think that the file should exist in the old master?
Even if it doesn't exist, ISTM that the old master can safely replay the WAL
records related to the file when it restarts. So what's the problem
if the file doesn't exist in the old master?

Well, some user may want to rewind the master down to the point where
WAL forked, and then recover it immediately when a consistent point is
reached just at restart instead of replugging it into the cluster. In
this case I think that you need the relation file of the dropped
relation to get a consistent state. That's still cheaper than
recreating a node from a fresh base backup in some cases, particularly
if the last base backup taken is far in the past for this cluster.

Documentation should be made clearer about that with a better error
message...

I'm wondering how we can recover (or rewind again) the old master from
that error. This also would need to be documented if we decide not to
fix any code regarding this problem...

FWIW, here is a scenario able to trigger the error with 1 master (port
5432, data at ~/data/5432) and 1 standby (port 5433, data at
~/data/5433).
$ psql -c 'create table aa as select generate_series(1,1000000)'
# Promote standby
$ pg_ctl promote -D ~/data/5433/
# Drop table on master
$ psql -c 'drop table aa'
DROP TABLE
$ pg_ctl stop -D ~/data/5432/

At this point there is no more relation file on master for 'aa', it is
still present on standby. Running pg_rewind at this point will work,
the relation file would be copied from the promoted standby to master.

$ lldb -- pg_rewind -D 5432 --source-server="port=5433 dbname=postgres"
Breakpoint pg_rewind after fetchSourceFileList() and before replaying
the changes from the block map, drop table 'aa' on standby and
checkpoint it, then the source file list is inconsistent and pg_rewind
will fail. This can just happen with --source-server, with
--source-pgdata

Adding a sleep() of a couple of seconds in pg_rewind may be better to
trigger directly the error ;), with DROP DATABASE for example.

Regards,
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Fujii Masao
masao.fujii@gmail.com
In reply to: Michael Paquier (#4)
Re: pg_rewind failure by file deletion in source server

On Fri, Jun 12, 2015 at 3:17 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Thu, Jun 11, 2015 at 5:48 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Thu, Jun 11, 2015 at 2:14 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Thu, Jun 11, 2015 at 1:51 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

Shouldn't pg_rewind ignore that failure of operation? If the file is not
found in source server, the file doesn't need to be copied to destination
server obviously. So ISTM that pg_rewind safely can skip copying that file.
Thought?

I think that you should fail. Let's imagine that the master to be
rewound has removed a relation file before being stopped cleanly after
its standby has been promoted that was here at the last checkpoint
before forking, and that the standby still has the relation file after
promotion. You should be able to copy it to be able to replay WAL on
it. If the standby has removed a file in the file map after taking the
file map, I guess that the best thing to do is fail because the file
that should be here for the rewound node cannot be fetched.

In this case, why do you think that the file should exist in the old master?
Even if it doesn't exist, ISTM that the old master can safely replay the WAL
records related to the file when it restarts. So what's the problem
if the file doesn't exist in the old master?

Well, some user may want to rewind the master down to the point where
WAL forked, and then recover it immediately when a consistent point is
reached just at restart instead of replugging it into the cluster. In
this case I think that you need the relation file of the dropped
relation to get a consistent state. That's still cheaper than
recreating a node from a fresh base backup in some cases, particularly
if the last base backup taken is far in the past for this cluster.

So it's the case where a user wants to recover old master up to the point
BEFORE the file in question is deleted in new master. At that point,
since the file must exist, pg_rewind should fail if the file cannot be copied
from new master. Is my understanding right?

As far as I read the code of pg_rewind, ISTM that your scenario never happens.
Because pg_rewind sets the minimum recovery point to the latest WAL location
in new master, i.e., AFTER the file is deleted. So old master cannot stop
recovering before the file is deleted in new master. If the recovery stops
at that point, it fails because the minimum recovery point is not reached yet.

IOW, after pg_rewind runs, the old master has to replay the WAL records
which were generated by the deletion of the file in the new master.
So it's okay if the old master doesn't have the file after pg_rewind runs,
I think.

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Michael Paquier
michael.paquier@gmail.com
In reply to: Fujii Masao (#5)
Re: pg_rewind failure by file deletion in source server

On Fri, Jun 12, 2015 at 3:50 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Fri, Jun 12, 2015 at 3:17 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Thu, Jun 11, 2015 at 5:48 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Thu, Jun 11, 2015 at 2:14 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Thu, Jun 11, 2015 at 1:51 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

Shouldn't pg_rewind ignore that failure of operation? If the file is not
found in source server, the file doesn't need to be copied to destination
server obviously. So ISTM that pg_rewind safely can skip copying that file.
Thought?

I think that you should fail. Let's imagine that the master to be
rewound has removed a relation file before being stopped cleanly after
its standby has been promoted that was here at the last checkpoint
before forking, and that the standby still has the relation file after
promotion. You should be able to copy it to be able to replay WAL on
it. If the standby has removed a file in the file map after taking the
file map, I guess that the best thing to do is fail because the file
that should be here for the rewound node cannot be fetched.

In this case, why do you think that the file should exist in the old master?
Even if it doesn't exist, ISTM that the old master can safely replay the WAL
records related to the file when it restarts. So what's the problem
if the file doesn't exist in the old master?

Well, some user may want to rewind the master down to the point where
WAL forked, and then recover it immediately when a consistent point is
reached just at restart instead of replugging it into the cluster. In
this case I think that you need the relation file of the dropped
relation to get a consistent state. That's still cheaper than
recreating a node from a fresh base backup in some cases, particularly
if the last base backup taken is far in the past for this cluster.

So it's the case where a user wants to recover old master up to the point
BEFORE the file in question is deleted in new master. At that point,
since the file must exist, pg_rewind should fail if the file cannot be copied
from new master. Is my understanding right?

Yep. We are on the same line.

As far as I read the code of pg_rewind, ISTM that your scenario never happens.
Because pg_rewind sets the minimum recovery point to the latest WAL location
in new master, i.e., AFTER the file is deleted. So old master cannot stop
recovering before the file is deleted in new master. If the recovery stops
at that point, it fails because the minimum recovery point is not reached yet.

IOW, after pg_rewind runs, the old master has to replay the WAL records
which were generated by the deletion of the file in the new master.
So it's okay if the old master doesn't have the file after pg_rewind runs,
I think.

Ah, right. I withdraw, indeed what I thought can not happen:
/*
* Update control file of target. Make it ready to perform archive
* recovery when restarting.
*
* minRecoveryPoint is set to the current WAL insert location in the
* source server. Like in an online backup, it's important
that we recover
* all the WAL that was generated while we copied the files over.
*/
So a rewound node will replay WAL up to the current insert location of
the source, and will fail at recovery if recovery target is older than
this insert location..

You want to draft a patch? Should I? I think that we should have a
test case as well in pg_rewind/t/.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#7Fujii Masao
masao.fujii@gmail.com
In reply to: Michael Paquier (#6)
Re: pg_rewind failure by file deletion in source server

On Fri, Jun 12, 2015 at 4:29 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Fri, Jun 12, 2015 at 3:50 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Fri, Jun 12, 2015 at 3:17 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Thu, Jun 11, 2015 at 5:48 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Thu, Jun 11, 2015 at 2:14 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Thu, Jun 11, 2015 at 1:51 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

Shouldn't pg_rewind ignore that failure of operation? If the file is not
found in source server, the file doesn't need to be copied to destination
server obviously. So ISTM that pg_rewind safely can skip copying that file.
Thought?

I think that you should fail. Let's imagine that the master to be
rewound has removed a relation file before being stopped cleanly after
its standby has been promoted that was here at the last checkpoint
before forking, and that the standby still has the relation file after
promotion. You should be able to copy it to be able to replay WAL on
it. If the standby has removed a file in the file map after taking the
file map, I guess that the best thing to do is fail because the file
that should be here for the rewound node cannot be fetched.

In this case, why do you think that the file should exist in the old master?
Even if it doesn't exist, ISTM that the old master can safely replay the WAL
records related to the file when it restarts. So what's the problem
if the file doesn't exist in the old master?

Well, some user may want to rewind the master down to the point where
WAL forked, and then recover it immediately when a consistent point is
reached just at restart instead of replugging it into the cluster. In
this case I think that you need the relation file of the dropped
relation to get a consistent state. That's still cheaper than
recreating a node from a fresh base backup in some cases, particularly
if the last base backup taken is far in the past for this cluster.

So it's the case where a user wants to recover old master up to the point
BEFORE the file in question is deleted in new master. At that point,
since the file must exist, pg_rewind should fail if the file cannot be copied
from new master. Is my understanding right?

Yep. We are on the same line.

As far as I read the code of pg_rewind, ISTM that your scenario never happens.
Because pg_rewind sets the minimum recovery point to the latest WAL location
in new master, i.e., AFTER the file is deleted. So old master cannot stop
recovering before the file is deleted in new master. If the recovery stops
at that point, it fails because the minimum recovery point is not reached yet.

IOW, after pg_rewind runs, the old master has to replay the WAL records
which were generated by the deletion of the file in the new master.
So it's okay if the old master doesn't have the file after pg_rewind runs,
I think.

Ah, right. I withdraw, indeed what I thought can not happen:
/*
* Update control file of target. Make it ready to perform archive
* recovery when restarting.
*
* minRecoveryPoint is set to the current WAL insert location in the
* source server. Like in an online backup, it's important
that we recover
* all the WAL that was generated while we copied the files over.
*/
So a rewound node will replay WAL up to the current insert location of
the source, and will fail at recovery if recovery target is older than
this insert location..

You want to draft a patch? Should I?

Please feel free to try that! :)

I think that we should have a
test case as well in pg_rewind/t/.

Maybe.

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Michael Paquier
michael.paquier@gmail.com
In reply to: Fujii Masao (#7)
2 attachment(s)
Re: pg_rewind failure by file deletion in source server

On Fri, Jun 12, 2015 at 9:02 PM, Fujii Masao wrote:

You want to draft a patch? Should I?

Please feel free to try that! :)

OK, so attached are a patch and a test case able to trigger easily the
error. Apply the patch and run the test case to reproduce the
following failure:
$ ERROR: could not open file "base/16384/16385_fsm" for reading: No
such file or directory
STATEMENT: SELECT path, begin,
pg_read_binary_file(path, begin, len) AS chunk
FROM fetchchunks
The patch adds a call to pg_usleep after the list of files from source
server has been fetched with libpq in pg_rewind.c to let time to run
some DROP actions, like DROP DATABASE, DROP TABLE, etc in order to
trigger the error easily.

In order to reduce the risk of failure to a minimum and to preserve
the performance of the tool when using --source-server, I think that
we should add some check using pg_stat_file to see if a file is still
present or not, and if it is missing we can safely skip it thanks to
minRecoveryPoint. Now the problem is that pg_stat_file fails
automatically if the file targeted is missing. Hence, to avoid a bunch
of round trips with the server with one call to pg_stat_dir per file,
I think that we should add some if_not_exists option to pg_stat_file,
defaulting to false, to skip the error related to the file missing and
have it return NULL in this case. Then we could use this filter on the
file path in libpq_executeFileMap() to fetch only the file chunks that
actually exist on the server. Note that we could as well use some
plpgsql-ing to do the same, but the extension of pg_stat_file looks
more useful to me. Thoughts?
--
Michael

Attachments:

rewind_test.bashapplication/octet-stream; name=rewind_test.bashDownload
20150616_pgrewind_sleep.patchbinary/octet-stream; name=20150616_pgrewind_sleep.patchDownload
diff --git a/src/bin/pg_rewind/pg_rewind.c b/src/bin/pg_rewind/pg_rewind.c
index 7e54ac5..ea5385c 100644
--- a/src/bin/pg_rewind/pg_rewind.c
+++ b/src/bin/pg_rewind/pg_rewind.c
@@ -291,6 +291,12 @@ main(int argc, char **argv)
 		print_filemap();
 
 	/*
+	 * Sleep here to allow user to do nasty things with the source file
+	 * list.
+	 */
+	pg_usleep(10000000); /* 10s */
+
+	/*
 	 * Ok, we're ready to start copying things over.
 	 */
 	if (showprogress)
#9Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Michael Paquier (#8)
Re: pg_rewind failure by file deletion in source server

On 06/16/2015 02:04 AM, Michael Paquier wrote:

In order to reduce the risk of failure to a minimum and to preserve
the performance of the tool when using --source-server, I think that
we should add some check using pg_stat_file to see if a file is still
present or not, and if it is missing we can safely skip it thanks to
minRecoveryPoint. Now the problem is that pg_stat_file fails
automatically if the file targeted is missing. Hence, to avoid a bunch
of round trips with the server with one call to pg_stat_dir per file,
I think that we should add some if_not_exists option to pg_stat_file,
defaulting to false, to skip the error related to the file missing and
have it return NULL in this case. Then we could use this filter on the
file path in libpq_executeFileMap() to fetch only the file chunks that
actually exist on the server.

You'll also need to add the option to pg_read_binary_file, though,
because even if you do a test with pg_stat_file() just before reading
the file, there's a race condition: someone might still delete file
between pg_stat_file() and pg_read_file().

Listing the directories with pg_ls_dir() has the same problem. As does
pg_tablespace_location().

Note that we could as well use some plpgsql-ing to do the same, but
the extension of pg_stat_file looks more useful to me. Thoughts?

Hmm. You'll need to add the option to all of those functions. Maybe it's
nevertheless the simplest approach.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10Michael Paquier
michael.paquier@gmail.com
In reply to: Heikki Linnakangas (#9)
Re: pg_rewind failure by file deletion in source server

On Thu, Jun 18, 2015 at 10:55 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 06/16/2015 02:04 AM, Michael Paquier wrote:

In order to reduce the risk of failure to a minimum and to preserve
the performance of the tool when using --source-server, I think that
we should add some check using pg_stat_file to see if a file is still
present or not, and if it is missing we can safely skip it thanks to
minRecoveryPoint. Now the problem is that pg_stat_file fails
automatically if the file targeted is missing. Hence, to avoid a bunch
of round trips with the server with one call to pg_stat_dir per file,
I think that we should add some if_not_exists option to pg_stat_file,
defaulting to false, to skip the error related to the file missing and
have it return NULL in this case. Then we could use this filter on the
file path in libpq_executeFileMap() to fetch only the file chunks that
actually exist on the server.

You'll also need to add the option to pg_read_binary_file, though, because
even if you do a test with pg_stat_file() just before reading the file,
there's a race condition: someone might still delete file between
pg_stat_file() and pg_read_file().

I propose to return NULL values if the file does not exist and
if_not_exists = true for both of them. Does that sound fine?

Listing the directories with pg_ls_dir() has the same problem.

(After some discussion on IM with Heikki on this one).
This is actually more tricky because pg_ls_dir() does not return '.'
or '..' that we could use to identify that the directory actually
exists or not when it is empty. Hence I think that we should add two
options to pg_ls_dir:
- include_self, default to false. If set to true, '.' is added in the
list of items.
- if_not_exists, to bypass error that a folder does not exist, default
at false. If if_not_exists = true and include_self = true, returning
only '.' would mean that the folder exist but that it is empty. If
if_not_exists = true and include_self = false, no rows are returned.
We could as well ERROR as well if both options are set like that. I am
fine with any of them as long as behavior is properly documented.

As does pg_tablespace_location().

NULL if tablespace path does not exist anymore. Is that fine.

Note that we could as well use some plpgsql-ing to do the same, but
the extension of pg_stat_file looks more useful to me. Thoughts?

Hmm. You'll need to add the option to all of those functions. Maybe it's
nevertheless the simplest approach.

With plpgsql you could use a try/catch/raising block to do the work.
But it still looks better to me to have alternative options with the
in-core functions.

I am fine to spend time on all those things and provide test cases,
let's just get a precise picture of what we want first.
Regards,
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11Robert Haas
robertmhaas@gmail.com
In reply to: Michael Paquier (#10)
Re: pg_rewind failure by file deletion in source server

On Fri, Jun 19, 2015 at 12:14 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:

Listing the directories with pg_ls_dir() has the same problem.

(After some discussion on IM with Heikki on this one).
This is actually more tricky because pg_ls_dir() does not return '.'
or '..' that we could use to identify that the directory actually
exists or not when it is empty. Hence I think that we should add two
options to pg_ls_dir:
- include_self, default to false. If set to true, '.' is added in the
list of items.
- if_not_exists, to bypass error that a folder does not exist, default
at false. If if_not_exists = true and include_self = true, returning
only '.' would mean that the folder exist but that it is empty. If
if_not_exists = true and include_self = false, no rows are returned.
We could as well ERROR as well if both options are set like that. I am
fine with any of them as long as behavior is properly documented.

Including '.' to distinguish between an empty directory and a
nonexistent one seems like an unnecessarily complicated and
non-obvious API. How about just one additional parameter bool
*exists. If NULL and no directory, ERROR, else on return set *exists
to true or false.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12Robert Haas
robertmhaas@gmail.com
In reply to: Robert Haas (#11)
Re: pg_rewind failure by file deletion in source server

On Fri, Jun 19, 2015 at 8:18 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Jun 19, 2015 at 12:14 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:

Listing the directories with pg_ls_dir() has the same problem.

(After some discussion on IM with Heikki on this one).
This is actually more tricky because pg_ls_dir() does not return '.'
or '..' that we could use to identify that the directory actually
exists or not when it is empty. Hence I think that we should add two
options to pg_ls_dir:
- include_self, default to false. If set to true, '.' is added in the
list of items.
- if_not_exists, to bypass error that a folder does not exist, default
at false. If if_not_exists = true and include_self = true, returning
only '.' would mean that the folder exist but that it is empty. If
if_not_exists = true and include_self = false, no rows are returned.
We could as well ERROR as well if both options are set like that. I am
fine with any of them as long as behavior is properly documented.

Including '.' to distinguish between an empty directory and a
nonexistent one seems like an unnecessarily complicated and
non-obvious API. How about just one additional parameter bool
*exists. If NULL and no directory, ERROR, else on return set *exists
to true or false.

Err, wait. You're talking about an SQL function, heh heh. So that
won't work. Maybe what you proposed is the best we can do, then,
although I would suggest that you rename the include_self parameter to
something like include_dot_dirs and return both "." and "..".
Returning only "." seems like it will seem weird to people.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13Michael Paquier
michael.paquier@gmail.com
In reply to: Robert Haas (#12)
8 attachment(s)
Re: pg_rewind failure by file deletion in source server

On Fri, Jun 19, 2015 at 9:22 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Jun 19, 2015 at 8:18 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Jun 19, 2015 at 12:14 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:

Listing the directories with pg_ls_dir() has the same problem.

(After some discussion on IM with Heikki on this one).
This is actually more tricky because pg_ls_dir() does not return '.'
or '..' that we could use to identify that the directory actually
exists or not when it is empty. Hence I think that we should add two
options to pg_ls_dir:
- include_self, default to false. If set to true, '.' is added in the
list of items.
- if_not_exists, to bypass error that a folder does not exist, default
at false. If if_not_exists = true and include_self = true, returning
only '.' would mean that the folder exist but that it is empty. If
if_not_exists = true and include_self = false, no rows are returned.
We could as well ERROR as well if both options are set like that. I am
fine with any of them as long as behavior is properly documented.

Including '.' to distinguish between an empty directory and a
nonexistent one seems like an unnecessarily complicated and
non-obvious API. How about just one additional parameter bool
*exists. If NULL and no directory, ERROR, else on return set *exists
to true or false.

Err, wait. You're talking about an SQL function, heh heh. So that
won't work. Maybe what you proposed is the best we can do, then,
although I would suggest that you rename the include_self parameter to
something like include_dot_dirs and return both "." and "..".
Returning only "." seems like it will seem weird to people.

So... Attached are a set of patches dedicated at fixing this issue:
- 0001, add if_not_exists to pg_tablespace_location, returning NULL if
path does not exist
- 0002, same with pg_stat_file, returning NULL if file does not exist
- 0003, same with pg_read_*file. I added them to all the existing
functions for consistency.
- 0004, pg_ls_dir extended with if_not_exists and include_dot_dirs
(thanks Robert for the naming!)
- 0005, as things get complex, a set of regression tests aimed to
covering those things. pg_tablespace_location is platform-dependent,
so there are no tests for it.
- 0006, the fix for pg_rewind, using what has been implemented before.

Attached is an updated test case triggering the issue
(rewind_test.bash), with the small patch attached that adds a pg_sleep
call in pg_rewind.c (20150623_pg_rewind_sleep.patch).

I imagine that this is a bug people are going to meet in the field
easily, particularly with temporary relation files or temporary XLOG
files.
Regards,
--
Michael

Attachments:

0006-Make-pg_rewind-able-to-detect-deleted-files-on-remot.patchapplication/x-patch; name=0006-Make-pg_rewind-able-to-detect-deleted-files-on-remot.patchDownload
From f12a2cec0489d6c20ed55665bc6698ac701e5d73 Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@otacoo.com>
Date: Tue, 23 Jun 2015 13:16:23 +0900
Subject: [PATCH 6/6] Make pg_rewind able to detect deleted files on remote
 source

When getting a list of PGDATA files from a remote server with pg_rewind,
it is possible that this list contains temporary files or files that have
been deleted on the remote server after taking the file list, like what
DROP TABLE would do for a relation. This can lead to errors when calling
pg_read_binary_file on an entry that is listed but does not exist anymore,
preventing the rewind from working correctly.

In order to prevent such problems, use the new if_not_exists options
in pg_ls_dir, pg_tablespace_location and pg_read_binary_file to not
fail if a file does not exist anymore and continue process as for
example a relation file removed does not prevent a rewound node from
replaying WAL in recovery, node rewound guaranteed to have a minimum
recovery point set to the current WAL location of the remote node that
is put in sync with.
---
 src/bin/pg_rewind/libpq_fetch.c | 41 ++++++++++++++++++++++++++++++-----------
 1 file changed, 30 insertions(+), 11 deletions(-)

diff --git a/src/bin/pg_rewind/libpq_fetch.c b/src/bin/pg_rewind/libpq_fetch.c
index 6ffd24e..2051d30 100644
--- a/src/bin/pg_rewind/libpq_fetch.c
+++ b/src/bin/pg_rewind/libpq_fetch.c
@@ -149,18 +149,24 @@ libpqProcessFileList(void)
 	sql =
 		"WITH RECURSIVE files (path, filename, size, isdir) AS (\n"
 		"  SELECT '' AS path, filename, size, isdir FROM\n"
-		"  (SELECT pg_ls_dir('.') AS filename) AS fn,\n"
-		"        pg_stat_file(fn.filename) AS this\n"
+		"    (SELECT filename\n"
+		"         FROM pg_ls_dir('.', true, true) AS filename\n"
+		"         WHERE filename NOT IN ('.', '..')) AS fn,\n"
+		"    pg_stat_file(fn.filename, true) AS this\n"
+		"    WHERE this.size IS NOT NULL\n"
 		"  UNION ALL\n"
 		"  SELECT parent.path || parent.filename || '/' AS path,\n"
 		"         fn, this.size, this.isdir\n"
 		"  FROM files AS parent,\n"
-		"       pg_ls_dir(parent.path || parent.filename) AS fn,\n"
-		"       pg_stat_file(parent.path || parent.filename || '/' || fn) AS this\n"
-		"       WHERE parent.isdir = 't'\n"
-		")\n"
-		"SELECT path || filename, size, isdir,\n"
-		"       pg_tablespace_location(pg_tablespace.oid) AS link_target\n"
+		"       pg_ls_dir(parent.path || parent.filename, true, true) AS fn,\n"
+		"       pg_stat_file(parent.path || parent.filename || '/' || fn, true)\n"
+		"              AS this\n"
+		"       WHERE parent.isdir = 't' AND\n"
+		"             (fn NOT IN ('.', '..')) AND\n"
+		"	     this.size IS NOT NULL)\n"
+		"SELECT path || filename AS file_path, size, isdir,\n"
+		"       pg_tablespace_location(pg_tablespace.oid, true)\n"
+		"              AS link_target\n"
 		"FROM files\n"
 		"LEFT OUTER JOIN pg_tablespace ON files.path = 'pg_tblspc/'\n"
 		"                             AND oid::text = files.filename\n";
@@ -259,8 +265,7 @@ receiveFileChunks(const char *sql)
 		}
 
 		if (PQgetisnull(res, 0, 0) ||
-			PQgetisnull(res, 0, 1) ||
-			PQgetisnull(res, 0, 2))
+			PQgetisnull(res, 0, 1))
 		{
 			pg_fatal("unexpected NULL result while fetching remote files\n");
 		}
@@ -278,6 +283,20 @@ receiveFileChunks(const char *sql)
 		memcpy(filename, PQgetvalue(res, 0, 0), filenamelen);
 		filename[filenamelen] = '\0';
 
+		/*
+		 * It may be possible that a file has been deleted on remote side after
+		 * creating the file map. In this case simply ignore it and move on.
+		 */
+		if (PQgetisnull(res, 0, 2))
+		{
+			pg_log(PG_DEBUG,
+				   "received NULL chunk for file \"%s\", file has been deleted\n",
+				   filename);
+			pg_free(filename);
+			PQclear(res);
+			continue;
+		}
+
 		chunk = PQgetvalue(res, 0, 2);
 
 		pg_log(PG_DEBUG, "received chunk for file \"%s\", off %d, len %d\n",
@@ -445,7 +464,7 @@ libpq_executeFileMap(filemap_t *map)
 	 */
 	sql =
 		"SELECT path, begin, \n"
-		"  pg_read_binary_file(path, begin, len) AS chunk\n"
+		"  pg_read_binary_file(path, begin, len, true) AS chunk\n"
 		"FROM fetchchunks\n";
 
 	receiveFileChunks(sql);
-- 
2.4.4

0001-Extend-pg_tablespace_location-with-if_not_exists-opt.patchapplication/x-patch; name=0001-Extend-pg_tablespace_location-with-if_not_exists-opt.patchDownload
From 454fa9e67c67621b0832d7fbd90153328c99197b Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@otacoo.com>
Date: Tue, 23 Jun 2015 11:45:49 +0900
Subject: [PATCH 1/6] Extend pg_tablespace_location with if_not_exists option

This is useful for code paths that prefer receive an empty response
instead of an ERROR if tablespace specified does not exist.
---
 doc/src/sgml/func.sgml        |  8 ++++--
 src/backend/utils/adt/misc.c  | 60 ++++++++++++++++++++++++++++++++++---------
 src/include/catalog/pg_proc.h |  2 ++
 src/include/utils/builtins.h  |  1 +
 4 files changed, 57 insertions(+), 14 deletions(-)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 650051b..7929e7e 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -15844,9 +15844,13 @@ SELECT pg_type_is_visible('myschema.widget'::regtype);
        <entry>get the set of database OIDs that have objects in the tablespace</entry>
       </row>
       <row>
-       <entry><literal><function>pg_tablespace_location(<parameter>tablespace_oid</parameter>)</function></literal></entry>
+       <entry><literal><function>pg_tablespace_location(<parameter>tablespace_oid</parameter> [, <parameter>if_not_exists</> <type>boolean</type>])</function></literal></entry>
        <entry><type>text</type></entry>
-       <entry>get the path in the file system that this tablespace is located in</entry>
+       <entry>
+        Get the path in the file system that this tablespace is located in.
+        If <parameter>if_not_exists</parameter> is true, <literal>NULL</literal>
+        is returned if tablespace does not exist instead of an error.
+       </entry>
       </row>
       <row>
        <entry><literal><function>pg_typeof(<parameter>any</parameter>)</function></literal></entry>
diff --git a/src/backend/utils/adt/misc.c b/src/backend/utils/adt/misc.c
index c0495d9..034728d 100644
--- a/src/backend/utils/adt/misc.c
+++ b/src/backend/utils/adt/misc.c
@@ -336,17 +336,24 @@ pg_tablespace_databases(PG_FUNCTION_ARGS)
 	SRF_RETURN_DONE(funcctx);
 }
 
-
 /*
- * pg_tablespace_location - get location for a tablespace
+ * Wrapper function for pg_tablespace_location and
+ * pg_tablespace_location_extended.
  */
-Datum
-pg_tablespace_location(PG_FUNCTION_ARGS)
+static Datum
+tablespace_location_wrapper(FunctionCallInfo fcinfo)
 {
 	Oid			tablespaceOid = PG_GETARG_OID(0);
 	char		sourcepath[MAXPGPATH];
 	char		targetpath[MAXPGPATH];
 	int			rllen;
+	bool		if_not_exists = false;
+
+	/*
+	 * Check for IF NOT EXISTS option.
+	 */
+	if (PG_NARGS() == 2 && !PG_ARGISNULL(1))
+		if_not_exists = PG_GETARG_BOOL(1);
 
 	/*
 	 * It's useful to apply this function to pg_class.reltablespace, wherein
@@ -373,15 +380,25 @@ pg_tablespace_location(PG_FUNCTION_ARGS)
 
 	rllen = readlink(sourcepath, targetpath, sizeof(targetpath));
 	if (rllen < 0)
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not read symbolic link \"%s\": %m",
-						sourcepath)));
+	{
+		if (!if_not_exists)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read symbolic link \"%s\": %m",
+							sourcepath)));
+		else
+			PG_RETURN_NULL();
+	}
 	if (rllen >= sizeof(targetpath))
-		ereport(ERROR,
-				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
-				 errmsg("symbolic link \"%s\" target is too long",
-						sourcepath)));
+	{
+		if (!if_not_exists)
+			ereport(ERROR,
+					(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
+					 errmsg("symbolic link \"%s\" target is too long",
+							sourcepath)));
+		else
+			PG_RETURN_NULL();
+	}
 	targetpath[rllen] = '\0';
 
 	PG_RETURN_TEXT_P(cstring_to_text(targetpath));
@@ -394,6 +411,25 @@ pg_tablespace_location(PG_FUNCTION_ARGS)
 }
 
 /*
+ * pg_tablespace_location - get location for a tablespace
+ */
+Datum
+pg_tablespace_location(PG_FUNCTION_ARGS)
+{
+	return tablespace_location_wrapper(fcinfo);
+}
+
+/*
+ * pg_tablespace_location - get location for a tablespace, with option to
+ * bypass errors in case of a non-existing tablespace.
+ */
+Datum
+pg_tablespace_location_extended(PG_FUNCTION_ARGS)
+{
+	return tablespace_location_wrapper(fcinfo);
+}
+
+/*
  * pg_sleep - delay for N seconds
  */
 Datum
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index 6b3d194..1f163f0 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -2922,6 +2922,8 @@ DESCR("current trigger depth");
 
 DATA(insert OID = 3778 ( pg_tablespace_location PGNSP PGUID 12 1 0 0 0 f f f f t f s 1 0 25 "26" _null_ _null_ _null_ _null_ _null_ pg_tablespace_location _null_ _null_ _null_ ));
 DESCR("tablespace location");
+DATA(insert OID = 3579 ( pg_tablespace_location PGNSP PGUID 12 1 0 0 0 f f f f t f s 2 0 25 "26 16" _null_ _null_ _null_ _null_ _null_ pg_tablespace_location_extended _null_ _null_ _null_ ));
+DESCR("tablespace location");
 
 DATA(insert OID = 1946 (  encode						PGNSP PGUID 12 1 0 0 0 f f f f t f i 2 0 25 "17 25" _null_ _null_ _null_ _null_ _null_ binary_encode _null_ _null_ _null_ ));
 DESCR("convert bytea value into some ascii-only text string");
diff --git a/src/include/utils/builtins.h b/src/include/utils/builtins.h
index 51f25a2..ab78ec2 100644
--- a/src/include/utils/builtins.h
+++ b/src/include/utils/builtins.h
@@ -486,6 +486,7 @@ extern Datum pg_terminate_backend(PG_FUNCTION_ARGS);
 extern Datum pg_reload_conf(PG_FUNCTION_ARGS);
 extern Datum pg_tablespace_databases(PG_FUNCTION_ARGS);
 extern Datum pg_tablespace_location(PG_FUNCTION_ARGS);
+extern Datum pg_tablespace_location_extended(PG_FUNCTION_ARGS);
 extern Datum pg_rotate_logfile(PG_FUNCTION_ARGS);
 extern Datum pg_sleep(PG_FUNCTION_ARGS);
 extern Datum pg_get_keywords(PG_FUNCTION_ARGS);
-- 
2.4.4

0002-Extend-pg_stat_file-with-if_not_exists-option.patchapplication/x-patch; name=0002-Extend-pg_stat_file-with-if_not_exists-option.patchDownload
From 72ef91958e91e9d6ba4a72d325872edb65166cc7 Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@otacoo.com>
Date: Tue, 23 Jun 2015 11:48:34 +0900
Subject: [PATCH 2/6] Extend pg_stat_file with if_not_exists option

This is useful for code paths that prefer receiving a NULL result
instead of an ERROR if file specified does not exist.
---
 doc/src/sgml/func.sgml          |  9 ++++--
 src/backend/utils/adt/genfile.c | 65 +++++++++++++++++++++++++++++++----------
 src/include/catalog/pg_proc.h   |  2 ++
 src/include/utils/builtins.h    |  1 +
 4 files changed, 59 insertions(+), 18 deletions(-)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 7929e7e..0f70985 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -17830,10 +17830,15 @@ postgres=# SELECT * FROM pg_xlogfile_name_offset(pg_stop_backup());
       </row>
       <row>
        <entry>
-        <literal><function>pg_stat_file(<parameter>filename</> <type>text</>)</function></literal>
+        <literal><function>pg_stat_file(<parameter>filename</> <type>text</>[, <parameter>if_not_exists</> <type>boolean</type>])</function></literal>
        </entry>
        <entry><type>record</type></entry>
-       <entry>Return information about a file</entry>
+       <entry>
+        Return information about a file. If
+        <parameter>if_not_exists</parameter> is true, <literal>NULL</literal>
+        is returned for all result fields if file does not exist instead of
+        an error.
+       </entry>
       </row>
      </tbody>
     </tgroup>
diff --git a/src/backend/utils/adt/genfile.c b/src/backend/utils/adt/genfile.c
index f3f3cca..93eb379 100644
--- a/src/backend/utils/adt/genfile.c
+++ b/src/backend/utils/adt/genfile.c
@@ -254,10 +254,10 @@ pg_read_binary_file_all(PG_FUNCTION_ARGS)
 }
 
 /*
- * stat a file
+ * Wrapper for pg_stat_file and pg_stat_file_extended.
  */
-Datum
-pg_stat_file(PG_FUNCTION_ARGS)
+static Datum
+stat_file_wrapper(PG_FUNCTION_ARGS)
 {
 	text	   *filename_t = PG_GETARG_TEXT_P(0);
 	char	   *filename;
@@ -266,18 +266,29 @@ pg_stat_file(PG_FUNCTION_ARGS)
 	bool		isnull[6];
 	HeapTuple	tuple;
 	TupleDesc	tupdesc;
+	bool		if_not_exists = false;
+	bool		return_null = false;
 
 	if (!superuser())
 		ereport(ERROR,
 				(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
 				 (errmsg("must be superuser to get file information"))));
 
+	/* Check for option if_not_exists */
+	if (PG_NARGS() == 2 && !PG_ARGISNULL(1))
+		if_not_exists = PG_GETARG_BOOL(1);
+
 	filename = convert_and_check_filename(filename_t);
 
 	if (stat(filename, &fst) < 0)
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not stat file \"%s\": %m", filename)));
+	{
+		if (if_not_exists)
+			return_null = true;
+		else
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not stat file \"%s\": %m", filename)));
+	}
 
 	/*
 	 * This record type had better match the output parameters declared for me
@@ -298,20 +309,23 @@ pg_stat_file(PG_FUNCTION_ARGS)
 					   "isdir", BOOLOID, -1, 0);
 	BlessTupleDesc(tupdesc);
 
-	memset(isnull, false, sizeof(isnull));
+	memset(isnull, return_null, sizeof(isnull));
 
-	values[0] = Int64GetDatum((int64) fst.st_size);
-	values[1] = TimestampTzGetDatum(time_t_to_timestamptz(fst.st_atime));
-	values[2] = TimestampTzGetDatum(time_t_to_timestamptz(fst.st_mtime));
-	/* Unix has file status change time, while Win32 has creation time */
+	if (!return_null)
+	{
+		values[0] = Int64GetDatum((int64) fst.st_size);
+		values[1] = TimestampTzGetDatum(time_t_to_timestamptz(fst.st_atime));
+		values[2] = TimestampTzGetDatum(time_t_to_timestamptz(fst.st_mtime));
+		/* Unix has file status change time, while Win32 has creation time */
 #if !defined(WIN32) && !defined(__CYGWIN__)
-	values[3] = TimestampTzGetDatum(time_t_to_timestamptz(fst.st_ctime));
-	isnull[4] = true;
+		values[3] = TimestampTzGetDatum(time_t_to_timestamptz(fst.st_ctime));
+		isnull[4] = true;
 #else
-	isnull[3] = true;
-	values[4] = TimestampTzGetDatum(time_t_to_timestamptz(fst.st_ctime));
+		isnull[3] = true;
+		values[4] = TimestampTzGetDatum(time_t_to_timestamptz(fst.st_ctime));
 #endif
-	values[5] = BoolGetDatum(S_ISDIR(fst.st_mode));
+		values[5] = BoolGetDatum(S_ISDIR(fst.st_mode));
+	}
 
 	tuple = heap_form_tuple(tupdesc, values, isnull);
 
@@ -320,6 +334,25 @@ pg_stat_file(PG_FUNCTION_ARGS)
 	PG_RETURN_DATUM(HeapTupleGetDatum(tuple));
 }
 
+/*
+ * pg_stat_file_extended
+ * Stat a file, with possibility to bypass error in case of a missing file.
+ */
+Datum
+pg_stat_file_extended(PG_FUNCTION_ARGS)
+{
+	return stat_file_wrapper(fcinfo);
+}
+
+/*
+ * stat a file
+ */
+Datum
+pg_stat_file(PG_FUNCTION_ARGS)
+{
+	return stat_file_wrapper(fcinfo);
+}
+
 
 /*
  * List a directory (returns the filenames only)
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index 1f163f0..8da1781 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -3185,6 +3185,8 @@ DESCR("rotate log file");
 
 DATA(insert OID = 2623 ( pg_stat_file		PGNSP PGUID 12 1 0 0 0 f f f f t f v 1 0 2249 "25" "{25,20,1184,1184,1184,1184,16}" "{i,o,o,o,o,o,o}" "{filename,size,access,modification,change,creation,isdir}" _null_ _null_ pg_stat_file _null_ _null_ _null_ ));
 DESCR("get information about file");
+DATA(insert OID = 3307 ( pg_stat_file		PGNSP PGUID 12 1 0 0 0 f f f f t f v 2 0 2249 "25 16" "{25,16,20,1184,1184,1184,1184,16}" "{i,i,o,o,o,o,o,o}" "{filename,if_not_exists,size,access,modification,change,creation,isdir}" _null_ _null_ pg_stat_file_extended _null_ _null_ _null_ ));
+DESCR("get information about file");
 DATA(insert OID = 2624 ( pg_read_file		PGNSP PGUID 12 1 0 0 0 f f f f t f v 3 0 25 "25 20 20" _null_ _null_ _null_ _null_ _null_ pg_read_file _null_ _null_ _null_ ));
 DESCR("read text from a file");
 DATA(insert OID = 3826 ( pg_read_file		PGNSP PGUID 12 1 0 0 0 f f f f t f v 1 0 25 "25" _null_ _null_ _null_ _null_ _null_ pg_read_file_all _null_ _null_ _null_ ));
diff --git a/src/include/utils/builtins.h b/src/include/utils/builtins.h
index ab78ec2..e46650a 100644
--- a/src/include/utils/builtins.h
+++ b/src/include/utils/builtins.h
@@ -472,6 +472,7 @@ extern Datum pg_relation_filepath(PG_FUNCTION_ARGS);
 extern bytea *read_binary_file(const char *filename,
 				 int64 seek_offset, int64 bytes_to_read);
 extern Datum pg_stat_file(PG_FUNCTION_ARGS);
+extern Datum pg_stat_file_extended(PG_FUNCTION_ARGS);
 extern Datum pg_read_file(PG_FUNCTION_ARGS);
 extern Datum pg_read_file_all(PG_FUNCTION_ARGS);
 extern Datum pg_read_binary_file(PG_FUNCTION_ARGS);
-- 
2.4.4

0003-Add-IF-NOT-EXISTS-to-pg_read_file-and-pg_read_binary.patchapplication/x-patch; name=0003-Add-IF-NOT-EXISTS-to-pg_read_file-and-pg_read_binary.patchDownload
From b4e067ccefef8aac912211aee792af91c624b553 Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@otacoo.com>
Date: Tue, 23 Jun 2015 11:51:30 +0900
Subject: [PATCH 3/6] Add IF NOT EXISTS to pg_read_file and pg_read_binary_file

This is useful for code paths that do not want to fail if the file
queried does not exist.
---
 doc/src/sgml/func.sgml           |  16 +++-
 src/backend/commands/extension.c |   2 +-
 src/backend/utils/adt/genfile.c  | 174 +++++++++++++++++++++++++++++++++------
 src/include/catalog/pg_proc.h    |   8 ++
 src/include/utils/builtins.h     |   8 +-
 5 files changed, 179 insertions(+), 29 deletions(-)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 0f70985..edfea8b 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -17816,17 +17816,25 @@ postgres=# SELECT * FROM pg_xlogfile_name_offset(pg_stop_backup());
       </row>
       <row>
        <entry>
-        <literal><function>pg_read_file(<parameter>filename</> <type>text</> [, <parameter>offset</> <type>bigint</>, <parameter>length</> <type>bigint</>])</function></literal>
+        <literal><function>pg_read_file(<parameter>filename</> <type>text</> [, <parameter>offset</> <type>bigint</>, <parameter>length</> <type>bigint</> [, <parameter>if_not_exists</> <type>boolean</>] ])</function></literal>
        </entry>
        <entry><type>text</type></entry>
-       <entry>Return the contents of a text file</entry>
+       <entry>
+        Return the contents of a text file. If
+        <parameter>if_not_exists</parameter> is true, <literal>NULL</literal>
+        is returned if file does not exist instead of an error.
+       </entry>
       </row>
       <row>
        <entry>
-        <literal><function>pg_read_binary_file(<parameter>filename</> <type>text</> [, <parameter>offset</> <type>bigint</>, <parameter>length</> <type>bigint</>])</function></literal>
+        <literal><function>pg_read_binary_file(<parameter>filename</> <type>text</> [, <parameter>offset</> <type>bigint</>, <parameter>length</> <type>bigint</> [, <parameter>if_not_exists</> <type>boolean</>] ])</function></literal>
        </entry>
        <entry><type>bytea</type></entry>
-       <entry>Return the contents of a file</entry>
+       <entry>
+        Return the contents of a file. If
+        <parameter>if_not_exists</parameter> is true, <literal>NULL</literal>
+        is returned if file does not exist instead of an error.
+       </entry>
       </row>
       <row>
        <entry>
diff --git a/src/backend/commands/extension.c b/src/backend/commands/extension.c
index 5cc74d0..25bbe01 100644
--- a/src/backend/commands/extension.c
+++ b/src/backend/commands/extension.c
@@ -640,7 +640,7 @@ read_extension_script_file(const ExtensionControlFile *control,
 	char	   *dest_str;
 	int			len;
 
-	content = read_binary_file(filename, 0, -1);
+	content = read_binary_file(filename, 0, -1, false);
 
 	/* use database encoding if not given */
 	if (control->encoding < 0)
diff --git a/src/backend/utils/adt/genfile.c b/src/backend/utils/adt/genfile.c
index 93eb379..6ce9ec3 100644
--- a/src/backend/utils/adt/genfile.c
+++ b/src/backend/utils/adt/genfile.c
@@ -88,7 +88,8 @@ convert_and_check_filename(text *arg)
  * We read the whole of the file when bytes_to_read is negative.
  */
 bytea *
-read_binary_file(const char *filename, int64 seek_offset, int64 bytes_to_read)
+read_binary_file(const char *filename, int64 seek_offset,
+				 int64 bytes_to_read, bool if_not_exists)
 {
 	bytea	   *buf;
 	size_t		nbytes;
@@ -103,9 +104,14 @@ read_binary_file(const char *filename, int64 seek_offset, int64 bytes_to_read)
 			struct stat fst;
 
 			if (stat(filename, &fst) < 0)
-				ereport(ERROR,
-						(errcode_for_file_access(),
-						 errmsg("could not stat file \"%s\": %m", filename)));
+			{
+				if (if_not_exists)
+					return NULL;
+				else
+					ereport(ERROR,
+							(errcode_for_file_access(),
+							 errmsg("could not stat file \"%s\": %m", filename)));
+			}
 
 			bytes_to_read = fst.st_size - seek_offset;
 		}
@@ -118,10 +124,15 @@ read_binary_file(const char *filename, int64 seek_offset, int64 bytes_to_read)
 				 errmsg("requested length too large")));
 
 	if ((file = AllocateFile(filename, PG_BINARY_R)) == NULL)
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not open file \"%s\" for reading: %m",
-						filename)));
+	{
+		if (if_not_exists)
+			return NULL;
+		else
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not open file \"%s\" for reading: %m",
+							filename)));
+	}
 
 	if (fseeko(file, (off_t) seek_offset,
 			   (seek_offset >= 0) ? SEEK_SET : SEEK_END) != 0)
@@ -150,11 +161,16 @@ read_binary_file(const char *filename, int64 seek_offset, int64 bytes_to_read)
  * in the database encoding.
  */
 static text *
-read_text_file(const char *filename, int64 seek_offset, int64 bytes_to_read)
+read_text_file(const char *filename, int64 seek_offset,
+			   int64 bytes_to_read, bool if_not_exists)
 {
 	bytea	   *buf;
 
-	buf = read_binary_file(filename, seek_offset, bytes_to_read);
+	buf = read_binary_file(filename, seek_offset,
+						   bytes_to_read, if_not_exists);
+
+	if (buf == NULL)
+		return NULL;
 
 	/* Make sure the input is valid */
 	pg_verifymbstr(VARDATA(buf), VARSIZE(buf) - VARHDRSZ, false);
@@ -164,21 +180,27 @@ read_text_file(const char *filename, int64 seek_offset, int64 bytes_to_read)
 }
 
 /*
- * Read a section of a file, returning it as text
+ * Wrapper function for pg_read_file and pg_read_file_extended.
  */
-Datum
-pg_read_file(PG_FUNCTION_ARGS)
+static Datum
+read_file_wrapper(PG_FUNCTION_ARGS)
 {
 	text	   *filename_t = PG_GETARG_TEXT_P(0);
 	int64		seek_offset = PG_GETARG_INT64(1);
 	int64		bytes_to_read = PG_GETARG_INT64(2);
+	bool		if_not_exists = false;
 	char	   *filename;
+	bytea	   *result;
 
 	if (!superuser())
 		ereport(ERROR,
 				(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
 				 (errmsg("must be superuser to read files"))));
 
+	/* Check for option if_not_exists */
+	if (PG_NARGS() == 4 && !PG_ARGISNULL(3))
+		if_not_exists = PG_GETARG_BOOL(3);
+
 	filename = convert_and_check_filename(filename_t);
 
 	if (bytes_to_read < 0)
@@ -186,44 +208,100 @@ pg_read_file(PG_FUNCTION_ARGS)
 				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 				 errmsg("requested length cannot be negative")));
 
-	PG_RETURN_TEXT_P(read_text_file(filename, seek_offset, bytes_to_read));
+	result = read_text_file(filename, seek_offset, bytes_to_read,
+							if_not_exists);
+	if (result == NULL)
+		PG_RETURN_NULL();
+	PG_RETURN_TEXT_P(result);
 }
 
 /*
- * Read the whole of a file, returning it as text
+ * Read a section of a file, returning it as text
  */
 Datum
-pg_read_file_all(PG_FUNCTION_ARGS)
+pg_read_file(PG_FUNCTION_ARGS)
+{
+	return read_file_wrapper(fcinfo);
+}
+
+/*
+ * Similar to pg_read_file, with IF NOT EXISTS option.
+ */
+Datum
+pg_read_file_extended(PG_FUNCTION_ARGS)
+{
+	return read_file_wrapper(fcinfo);
+}
+
+/*
+ * Wrapper function for pg_read_file_all and pg_read_file_all_extended.
+ */
+static Datum
+read_file_all_wrapper(PG_FUNCTION_ARGS)
 {
 	text	   *filename_t = PG_GETARG_TEXT_P(0);
 	char	   *filename;
+	text	   *result;
+	bool		if_not_exists = false;
 
 	if (!superuser())
 		ereport(ERROR,
 				(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
 				 (errmsg("must be superuser to read files"))));
 
+	/* Check for option if_not_exists */
+	if (PG_NARGS() == 2 && !PG_ARGISNULL(1))
+		if_not_exists = PG_GETARG_BOOL(1);
+
 	filename = convert_and_check_filename(filename_t);
 
-	PG_RETURN_TEXT_P(read_text_file(filename, 0, -1));
+	result = read_text_file(filename, 0, -1, if_not_exists);
+	if (result == NULL)
+		PG_RETURN_NULL();
+
+	PG_RETURN_TEXT_P(result);
 }
 
 /*
- * Read a section of a file, returning it as bytea
+ * Read the whole of a file, returning it as text
  */
 Datum
-pg_read_binary_file(PG_FUNCTION_ARGS)
+pg_read_file_all(PG_FUNCTION_ARGS)
+{
+	return read_file_all_wrapper(fcinfo);
+}
+
+/*
+ * Similar to pg_read_file_all, with IF NOT EXISTS option.
+ */
+Datum
+pg_read_file_all_extended(PG_FUNCTION_ARGS)
+{
+	return read_file_all_wrapper(fcinfo);
+}
+
+/*
+ * Wrapper for pg_read_binary_file and pg_read_binary_file_extended.
+ */
+static Datum
+read_binary_file_wrapper(PG_FUNCTION_ARGS)
 {
 	text	   *filename_t = PG_GETARG_TEXT_P(0);
 	int64		seek_offset = PG_GETARG_INT64(1);
 	int64		bytes_to_read = PG_GETARG_INT64(2);
 	char	   *filename;
+	bool		if_not_exists = false;
+	bytea	   *result;
 
 	if (!superuser())
 		ereport(ERROR,
 				(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
 				 (errmsg("must be superuser to read files"))));
 
+	/* Check for option if_not_exists */
+	if (PG_NARGS() == 4 && !PG_ARGISNULL(3))
+		if_not_exists = PG_GETARG_BOOL(3);
+
 	filename = convert_and_check_filename(filename_t);
 
 	if (bytes_to_read < 0)
@@ -231,26 +309,76 @@ pg_read_binary_file(PG_FUNCTION_ARGS)
 				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 				 errmsg("requested length cannot be negative")));
 
-	PG_RETURN_BYTEA_P(read_binary_file(filename, seek_offset, bytes_to_read));
+	result = read_binary_file(filename, seek_offset,
+							  bytes_to_read, if_not_exists);
+	if (result == NULL)
+		PG_RETURN_NULL();
+	PG_RETURN_BYTEA_P(result);
 }
 
 /*
- * Read the whole of a file, returning it as bytea
+ * Read a section of a file, returning it as bytea
  */
 Datum
-pg_read_binary_file_all(PG_FUNCTION_ARGS)
+pg_read_binary_file(PG_FUNCTION_ARGS)
+{
+	return read_binary_file_wrapper(fcinfo);
+}
+
+/*
+ * Similar to pg_read_binary_file, with IF NOT EXISTS option.
+ */
+Datum
+pg_read_binary_file_extended(PG_FUNCTION_ARGS)
+{
+	return read_binary_file_wrapper(fcinfo);
+}
+
+/*
+ * Wrapper for pg_read_binary_file_all and pg_read_binary_file_all_extended.
+ */
+static Datum
+read_binary_file_all_wrapper(PG_FUNCTION_ARGS)
 {
 	text	   *filename_t = PG_GETARG_TEXT_P(0);
 	char	   *filename;
+	bool		if_not_exists = false;
+	text	   *result;
 
 	if (!superuser())
 		ereport(ERROR,
 				(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
 				 (errmsg("must be superuser to read files"))));
 
+	/* Check for option if_not_exists */
+	if (PG_NARGS() == 2 && !PG_ARGISNULL(1))
+		if_not_exists = PG_GETARG_BOOL(1);
+
 	filename = convert_and_check_filename(filename_t);
 
-	PG_RETURN_BYTEA_P(read_binary_file(filename, 0, -1));
+	result = read_binary_file(filename, 0, -1, if_not_exists);
+	if (result == NULL)
+		PG_RETURN_NULL();
+
+	PG_RETURN_BYTEA_P(result);
+}
+
+/*
+ * Read the whole of a file, returning it as bytea
+ */
+Datum
+pg_read_binary_file_all(PG_FUNCTION_ARGS)
+{
+	return read_binary_file_all_wrapper(fcinfo);
+}
+
+/*
+ * Similar to pg_read_binary_file_all, with IF NOT EXISTS option.
+ */
+Datum
+pg_read_binary_file_all_extended(PG_FUNCTION_ARGS)
+{
+	return read_binary_file_all_wrapper(fcinfo);
 }
 
 /*
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index 8da1781..c805a9a 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -3189,12 +3189,20 @@ DATA(insert OID = 3307 ( pg_stat_file		PGNSP PGUID 12 1 0 0 0 f f f f t f v 2 0
 DESCR("get information about file");
 DATA(insert OID = 2624 ( pg_read_file		PGNSP PGUID 12 1 0 0 0 f f f f t f v 3 0 25 "25 20 20" _null_ _null_ _null_ _null_ _null_ pg_read_file _null_ _null_ _null_ ));
 DESCR("read text from a file");
+DATA(insert OID = 3293 ( pg_read_file		PGNSP PGUID 12 1 0 0 0 f f f f t f v 4 0 25 "25 20 20 16" _null_ _null_ _null_ _null_ _null_ pg_read_file_extended _null_ _null_ _null_ ));
+DESCR("read text from a file");
 DATA(insert OID = 3826 ( pg_read_file		PGNSP PGUID 12 1 0 0 0 f f f f t f v 1 0 25 "25" _null_ _null_ _null_ _null_ _null_ pg_read_file_all _null_ _null_ _null_ ));
 DESCR("read text from a file");
+DATA(insert OID = 3294 ( pg_read_file		PGNSP PGUID 12 1 0 0 0 f f f f t f v 2 0 25 "25 16" _null_ _null_ _null_ _null_ _null_ pg_read_file_all_extended _null_ _null_ _null_ ));
+DESCR("read text from a file");
 DATA(insert OID = 3827 ( pg_read_binary_file	PGNSP PGUID 12 1 0 0 0 f f f f t f v 3 0 17 "25 20 20" _null_ _null_ _null_ _null_ _null_ pg_read_binary_file _null_ _null_ _null_ ));
 DESCR("read bytea from a file");
+DATA(insert OID = 3295 ( pg_read_binary_file	PGNSP PGUID 12 1 0 0 0 f f f f t f v 4 0 17 "25 20 20 16" _null_ _null_ _null_ _null_ _null_ pg_read_binary_file_extended _null_ _null_ _null_ ));
+DESCR("read bytea from a file");
 DATA(insert OID = 3828 ( pg_read_binary_file	PGNSP PGUID 12 1 0 0 0 f f f f t f v 1 0 17 "25" _null_ _null_ _null_ _null_ _null_ pg_read_binary_file_all _null_ _null_ _null_ ));
 DESCR("read bytea from a file");
+DATA(insert OID = 3296 ( pg_read_binary_file	PGNSP PGUID 12 1 0 0 0 f f f f t f v 2 0 17 "25 16" _null_ _null_ _null_ _null_ _null_ pg_read_binary_file_all_extended _null_ _null_ _null_ ));
+DESCR("read bytea from a file");
 DATA(insert OID = 2625 ( pg_ls_dir			PGNSP PGUID 12 1 1000 0 0 f f f f t t v 1 0 25 "25" _null_ _null_ _null_ _null_ _null_ pg_ls_dir _null_ _null_ _null_ ));
 DESCR("list all files in a directory");
 DATA(insert OID = 2626 ( pg_sleep			PGNSP PGUID 12 1 0 0 0 f f f f t f v 1 0 2278 "701" _null_ _null_ _null_ _null_ _null_ pg_sleep _null_ _null_ _null_ ));
diff --git a/src/include/utils/builtins.h b/src/include/utils/builtins.h
index e46650a..82d0e18 100644
--- a/src/include/utils/builtins.h
+++ b/src/include/utils/builtins.h
@@ -470,13 +470,19 @@ extern Datum pg_relation_filepath(PG_FUNCTION_ARGS);
 
 /* genfile.c */
 extern bytea *read_binary_file(const char *filename,
-				 int64 seek_offset, int64 bytes_to_read);
+							   int64 seek_offset,
+							   int64 bytes_to_read,
+							   bool if_not_exists);
 extern Datum pg_stat_file(PG_FUNCTION_ARGS);
 extern Datum pg_stat_file_extended(PG_FUNCTION_ARGS);
 extern Datum pg_read_file(PG_FUNCTION_ARGS);
+extern Datum pg_read_file_extended(PG_FUNCTION_ARGS);
 extern Datum pg_read_file_all(PG_FUNCTION_ARGS);
+extern Datum pg_read_file_all_extended(PG_FUNCTION_ARGS);
 extern Datum pg_read_binary_file(PG_FUNCTION_ARGS);
+extern Datum pg_read_binary_file_extended(PG_FUNCTION_ARGS);
 extern Datum pg_read_binary_file_all(PG_FUNCTION_ARGS);
+extern Datum pg_read_binary_file_all_extended(PG_FUNCTION_ARGS);
 extern Datum pg_ls_dir(PG_FUNCTION_ARGS);
 
 /* misc.c */
-- 
2.4.4

0004-Extend-pg_ls_dir-with-include_dot_dirs-and-if_not_ex.patchapplication/x-patch; name=0004-Extend-pg_ls_dir-with-include_dot_dirs-and-if_not_ex.patchDownload
From f40a675db5e086a72e52cc9d0394728a0ada95f8 Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@otacoo.com>
Date: Mon, 22 Jun 2015 21:19:33 +0900
Subject: [PATCH 4/6] Extend pg_ls_dir with include_dot_dirs and if_not_exists

If include_dot_dirs is set to true, "." and ".." are listed in the list
of files listed. If if_not_exists is true, an empty list of files is
returned to caller instead of erroring out if the folder specified does
not exist.
---
 doc/src/sgml/func.sgml          | 10 +++++--
 src/backend/utils/adt/genfile.c | 59 +++++++++++++++++++++++++++++++++--------
 src/include/catalog/pg_proc.h   |  2 ++
 src/include/utils/builtins.h    |  1 +
 4 files changed, 59 insertions(+), 13 deletions(-)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index edfea8b..e9dc19f 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -17809,10 +17809,16 @@ postgres=# SELECT * FROM pg_xlogfile_name_offset(pg_stop_backup());
      <tbody>
       <row>
        <entry>
-        <literal><function>pg_ls_dir(<parameter>dirname</> <type>text</>)</function></literal>
+        <literal><function>pg_ls_dir(<parameter>dirname</> <type>text</> [, <parameter>if_not_exists</> <type>boolean</>, <parameter>include_dot_dirs</> <type>boolean</>] )</function></literal>
        </entry>
        <entry><type>setof text</type></entry>
-       <entry>List the contents of a directory</entry>
+       <entry>
+        List the contents of a directory.  If
+        <parameter>if_not_exists</parameter> is true, an empty result is
+        returned instead of an error. If <parameter>include_dot_dirs</>
+        is true, <quote>.</> and <quote>..</> are included as well in
+        the returned list.
+       </entry>
       </row>
       <row>
        <entry>
diff --git a/src/backend/utils/adt/genfile.c b/src/backend/utils/adt/genfile.c
index 6ce9ec3..6af7253 100644
--- a/src/backend/utils/adt/genfile.c
+++ b/src/backend/utils/adt/genfile.c
@@ -35,6 +35,7 @@ typedef struct
 {
 	char	   *location;
 	DIR		   *dirdesc;
+	bool		include_dot_dirs;
 } directory_fctx;
 
 
@@ -483,14 +484,15 @@ pg_stat_file(PG_FUNCTION_ARGS)
 
 
 /*
- * List a directory (returns the filenames only)
+ * Wrapper for pg_ls_dir and pg_ls_dir_extended.
  */
-Datum
-pg_ls_dir(PG_FUNCTION_ARGS)
+static Datum
+ls_dir_wrapper(PG_FUNCTION_ARGS)
 {
 	FuncCallContext *funcctx;
 	struct dirent *de;
 	directory_fctx *fctx;
+	MemoryContext	oldcontext;
 
 	if (!superuser())
 		ereport(ERROR,
@@ -499,7 +501,17 @@ pg_ls_dir(PG_FUNCTION_ARGS)
 
 	if (SRF_IS_FIRSTCALL())
 	{
-		MemoryContext oldcontext;
+		bool			if_not_exists = false;
+		bool			include_dot_dirs = false;
+
+		/* see if custom parameters have been passed down */
+		if (PG_NARGS() == 3)
+		{
+			if (!PG_ARGISNULL(1))
+				if_not_exists = PG_GETARG_BOOL(1);
+			if (!PG_ARGISNULL(2))
+				include_dot_dirs = PG_GETARG_BOOL(2);
+		}
 
 		funcctx = SRF_FIRSTCALL_INIT();
 		oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
@@ -508,13 +520,18 @@ pg_ls_dir(PG_FUNCTION_ARGS)
 		fctx->location = convert_and_check_filename(PG_GETARG_TEXT_P(0));
 
 		fctx->dirdesc = AllocateDir(fctx->location);
+		fctx->include_dot_dirs = include_dot_dirs;
 
 		if (!fctx->dirdesc)
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not open directory \"%s\": %m",
-							fctx->location)));
-
+		{
+			if (if_not_exists && errno == ENOENT)
+				SRF_RETURN_DONE(funcctx);
+			else
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not open directory \"%s\": %m",
+								fctx->location)));
+		}
 		funcctx->user_fctx = fctx;
 		MemoryContextSwitchTo(oldcontext);
 	}
@@ -524,8 +541,9 @@ pg_ls_dir(PG_FUNCTION_ARGS)
 
 	while ((de = ReadDir(fctx->dirdesc, fctx->location)) != NULL)
 	{
-		if (strcmp(de->d_name, ".") == 0 ||
-			strcmp(de->d_name, "..") == 0)
+		if (!fctx->include_dot_dirs &&
+			(strcmp(de->d_name, ".") == 0 ||
+			 strcmp(de->d_name, "..") == 0))
 			continue;
 
 		SRF_RETURN_NEXT(funcctx, CStringGetTextDatum(de->d_name));
@@ -535,3 +553,22 @@ pg_ls_dir(PG_FUNCTION_ARGS)
 
 	SRF_RETURN_DONE(funcctx);
 }
+
+/*
+ * List a directory (returns the filenames only)
+ */
+Datum
+pg_ls_dir(PG_FUNCTION_ARGS)
+{
+	return ls_dir_wrapper(fcinfo);
+}
+
+/*
+ * List a directory (can optionally return dot directories or
+ * use IF NOT EXISTS).
+ */
+Datum
+pg_ls_dir_extended(PG_FUNCTION_ARGS)
+{
+	return ls_dir_wrapper(fcinfo);
+}
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index c805a9a..b559572 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -3205,6 +3205,8 @@ DATA(insert OID = 3296 ( pg_read_binary_file	PGNSP PGUID 12 1 0 0 0 f f f f t f
 DESCR("read bytea from a file");
 DATA(insert OID = 2625 ( pg_ls_dir			PGNSP PGUID 12 1 1000 0 0 f f f f t t v 1 0 25 "25" _null_ _null_ _null_ _null_ _null_ pg_ls_dir _null_ _null_ _null_ ));
 DESCR("list all files in a directory");
+DATA(insert OID = 3297 ( pg_ls_dir			PGNSP PGUID 12 1 1000 0 0 f f f f t t v 3 0 25 "25 16 16" _null_ _null_ _null_ _null_ _null_ pg_ls_dir_extended _null_ _null_ _null_ ));
+DESCR("list all files in a directory");
 DATA(insert OID = 2626 ( pg_sleep			PGNSP PGUID 12 1 0 0 0 f f f f t f v 1 0 2278 "701" _null_ _null_ _null_ _null_ _null_ pg_sleep _null_ _null_ _null_ ));
 DESCR("sleep for the specified time in seconds");
 DATA(insert OID = 3935 ( pg_sleep_for			PGNSP PGUID 14 1 0 0 0 f f f f t f v 1 0 2278 "1186" _null_ _null_ _null_ _null_ _null_ "select pg_catalog.pg_sleep(extract(epoch from pg_catalog.clock_timestamp() operator(pg_catalog.+) $1) operator(pg_catalog.-) extract(epoch from pg_catalog.clock_timestamp()))" _null_ _null_ _null_ ));
diff --git a/src/include/utils/builtins.h b/src/include/utils/builtins.h
index 82d0e18..446cda6 100644
--- a/src/include/utils/builtins.h
+++ b/src/include/utils/builtins.h
@@ -484,6 +484,7 @@ extern Datum pg_read_binary_file_extended(PG_FUNCTION_ARGS);
 extern Datum pg_read_binary_file_all(PG_FUNCTION_ARGS);
 extern Datum pg_read_binary_file_all_extended(PG_FUNCTION_ARGS);
 extern Datum pg_ls_dir(PG_FUNCTION_ARGS);
+extern Datum pg_ls_dir_extended(PG_FUNCTION_ARGS);
 
 /* misc.c */
 extern Datum current_database(PG_FUNCTION_ARGS);
-- 
2.4.4

0005-Add-regression-tests-for-pg_ls_dir-and-pg_read_-bina.patchapplication/x-patch; name=0005-Add-regression-tests-for-pg_ls_dir-and-pg_read_-bina.patchDownload
From b15d22e38f0a27e0fedf3fb07e26dcb580f42cb9 Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@otacoo.com>
Date: Tue, 23 Jun 2015 10:42:48 +0900
Subject: [PATCH 5/6] Add regression tests for pg_ls_dir and
 pg_read_[binary_]file

With the new options if_not_exists that have been added, this looks
worth adding.
---
 src/test/regress/expected/file.out | 84 ++++++++++++++++++++++++++++++++++++++
 src/test/regress/parallel_schedule |  3 ++
 src/test/regress/sql/file.sql      | 29 +++++++++++++
 3 files changed, 116 insertions(+)
 create mode 100644 src/test/regress/expected/file.out
 create mode 100644 src/test/regress/sql/file.sql

diff --git a/src/test/regress/expected/file.out b/src/test/regress/expected/file.out
new file mode 100644
index 0000000..0ddae30
--- /dev/null
+++ b/src/test/regress/expected/file.out
@@ -0,0 +1,84 @@
+--
+-- FILE
+--
+-- Tests of SQL functions interacting directly with files of PGDATA
+-- pg_read_file
+SELECT pg_read_file('postgresql.auto.conf', 0, 23); -- OK
+      pg_read_file       
+-------------------------
+ # Do not edit this file
+(1 row)
+
+SELECT pg_read_file('postgresql_not_exists', 0, 23); -- not OK
+ERROR:  could not open file "postgresql_not_exists" for reading: No such file or directory
+SELECT pg_read_file('postgresql_not_exists', 0, 23, false); -- not OK
+ERROR:  could not open file "postgresql_not_exists" for reading: No such file or directory
+SELECT pg_read_file('postgresql_not_exists', 0, 23, true); -- OK
+ pg_read_file 
+--------------
+ 
+(1 row)
+
+-- pg_read_binary_file
+SELECT pg_read_binary_file('postgresql.auto.conf', 0, 23); -- OK
+               pg_read_binary_file                
+--------------------------------------------------
+ \x2320446f206e6f74206564697420746869732066696c65
+(1 row)
+
+SELECT pg_read_binary_file('postgresql_not_exists', 0, 23); -- not OK
+ERROR:  could not open file "postgresql_not_exists" for reading: No such file or directory
+SELECT pg_read_binary_file('postgresql_not_exists', 0, 23, false); -- not OK
+ERROR:  could not open file "postgresql_not_exists" for reading: No such file or directory
+SELECT pg_read_binary_file('postgresql_not_exists', 0, 23, true); -- OK
+ pg_read_binary_file 
+---------------------
+ 
+(1 row)
+
+-- pg_ls_dir
+SELECT pg_ls_dir('postgresql.conf'); -- not OK, not a directory
+ERROR:  could not open directory "postgresql.conf": Not a directory
+SELECT pg_ls_dir('pg_twophase'); -- empty directory
+ pg_ls_dir 
+-----------
+(0 rows)
+
+SELECT pg_ls_dir('pg_twophase', true, true); -- OK, not empty list
+ pg_ls_dir 
+-----------
+ .
+ ..
+(2 rows)
+
+SELECT pg_ls_dir('pg_twophase', false, false); -- OK, empty list
+ pg_ls_dir 
+-----------
+(0 rows)
+
+SELECT pg_ls_dir('pg_twophase', false, true); -- OK, not empty list
+ pg_ls_dir 
+-----------
+ .
+ ..
+(2 rows)
+
+SELECT pg_ls_dir('pg_twophase', true, false); -- OK, empty list
+ pg_ls_dir 
+-----------
+(0 rows)
+
+SELECT pg_ls_dir('postgresql_not_exists', false, true); -- not OK
+ERROR:  could not open directory "postgresql_not_exists": No such file or directory
+SELECT pg_ls_dir('postgresql_not_exists', false, false); -- not OK
+ERROR:  could not open directory "postgresql_not_exists": No such file or directory
+SELECT pg_ls_dir('postgresql_not_exists', true, true); -- OK, empty list
+ pg_ls_dir 
+-----------
+(0 rows)
+
+SELECT pg_ls_dir('postgresql_not_exists', true, false); -- OK, empty list
+ pg_ls_dir 
+-----------
+(0 rows)
+
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index 91780cd..b1739fa 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -108,5 +108,8 @@ test: plancache limit plpgsql copy2 temp domain rangefuncs prepare without_oid c
 # event triggers cannot run concurrently with any test that runs DDL
 test: event_trigger
 
+# Files read during this test may be modified in parallel by other tests
+test: file
+
 # run stats by itself because its delay may be insufficient under heavy load
 test: stats
diff --git a/src/test/regress/sql/file.sql b/src/test/regress/sql/file.sql
new file mode 100644
index 0000000..a0238f2
--- /dev/null
+++ b/src/test/regress/sql/file.sql
@@ -0,0 +1,29 @@
+--
+-- FILE
+--
+
+-- Tests of SQL functions interacting directly with files of PGDATA
+
+-- pg_read_file
+SELECT pg_read_file('postgresql.auto.conf', 0, 23); -- OK
+SELECT pg_read_file('postgresql_not_exists', 0, 23); -- not OK
+SELECT pg_read_file('postgresql_not_exists', 0, 23, false); -- not OK
+SELECT pg_read_file('postgresql_not_exists', 0, 23, true); -- OK
+
+-- pg_read_binary_file
+SELECT pg_read_binary_file('postgresql.auto.conf', 0, 23); -- OK
+SELECT pg_read_binary_file('postgresql_not_exists', 0, 23); -- not OK
+SELECT pg_read_binary_file('postgresql_not_exists', 0, 23, false); -- not OK
+SELECT pg_read_binary_file('postgresql_not_exists', 0, 23, true); -- OK
+
+-- pg_ls_dir
+SELECT pg_ls_dir('postgresql.conf'); -- not OK, not a directory
+SELECT pg_ls_dir('pg_twophase'); -- empty directory
+SELECT pg_ls_dir('pg_twophase', true, true); -- OK, not empty list
+SELECT pg_ls_dir('pg_twophase', false, false); -- OK, empty list
+SELECT pg_ls_dir('pg_twophase', false, true); -- OK, not empty list
+SELECT pg_ls_dir('pg_twophase', true, false); -- OK, empty list
+SELECT pg_ls_dir('postgresql_not_exists', false, true); -- not OK
+SELECT pg_ls_dir('postgresql_not_exists', false, false); -- not OK
+SELECT pg_ls_dir('postgresql_not_exists', true, true); -- OK, empty list
+SELECT pg_ls_dir('postgresql_not_exists', true, false); -- OK, empty list
-- 
2.4.4

20150623_pg_rewind_sleep.patchapplication/x-patch; name=20150623_pg_rewind_sleep.patchDownload
diff --git a/src/bin/pg_rewind/pg_rewind.c b/src/bin/pg_rewind/pg_rewind.c
index 7e54ac5..ea5385c 100644
--- a/src/bin/pg_rewind/pg_rewind.c
+++ b/src/bin/pg_rewind/pg_rewind.c
@@ -291,6 +291,12 @@ main(int argc, char **argv)
 		print_filemap();
 
 	/*
+	 * Sleep here to allow user to do nasty things with the source file
+	 * list.
+	 */
+	pg_usleep(10000000); /* 10s */
+
+	/*
 	 * Ok, we're ready to start copying things over.
 	 */
 	if (showprogress)
rewind_test.bashapplication/octet-stream; name=rewind_test.bashDownload
#14Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Michael Paquier (#13)
Re: pg_rewind failure by file deletion in source server

On 06/23/2015 07:51 AM, Michael Paquier wrote:

So... Attached are a set of patches dedicated at fixing this issue:

Thanks for working on this!

- 0001, add if_not_exists to pg_tablespace_location, returning NULL if
path does not exist
- 0002, same with pg_stat_file, returning NULL if file does not exist
- 0003, same with pg_read_*file. I added them to all the existing
functions for consistency.
- 0004, pg_ls_dir extended with if_not_exists and include_dot_dirs
(thanks Robert for the naming!)
- 0005, as things get complex, a set of regression tests aimed to
covering those things. pg_tablespace_location is platform-dependent,
so there are no tests for it.
- 0006, the fix for pg_rewind, using what has been implemented before.

With thes patches, pg_read_file() will return NULL for any failure to
open the file, which makes pg_rewind to assume that the file doesn't
exist in the source server, and will remove the file from the
destination. That's dangerous, those functions should check specifically
for ENOENT.

There's still a small race condition with tablespaces. If you run CREATE
TABLESPACE in the source server while pg_rewind is running, it's
possible that the recursive query that pg_rewind uses sees the symlink
in pg_tblspc/ directory, but its snapshot doesn't see the row in
pg_tablespace yet. It will think that the symlink is a regular file, try
to read it, and fail (if we checked for ENOENT).

Actually, I think we need try to deal with symlinks a bit harder.
Currently, pg_rewind assumes that anything in pg_tblspace that has a
matching row in pg_tablespace is a symlink, and nothing else is. I think
symlinks to directories. I just noticed that pg_rewind fails miserable
if pg_xlog is a symlink, because of that:

----
The servers diverged at WAL position 0/3023F08 on timeline 1.
Rewinding from last common checkpoint at 0/2000060 on timeline 1

"data-master//pg_xlog" is not a directory
Failure, exiting
----

I think we need to add a column to pg_stat_file output, to indicate
symbolic links, and add a pg_readlink() function. That still leaves a
race condition if the type of a file changes, i.e. a file is deleted and
a directory with the same name is created in its place, but that seems
acceptable. I don't think PostgreSQL ever does such a thing, so that
could only happen if you mess with the data directory manually while the
server is running.

I just realized another problem: We recently learned the hard way that
some people have files in the data directory that are not writeable by
the 'postgres' user
(/messages/by-id/20150523172627.GA24277@msg.df7cb.de).
pg_rewind will try to overwrite all files it doesn't recognize as
relation files, so it's going to fail on those. A straightforward fix
would be to first open the destination file in read-only mode, and
compare its contents, and only open the file in write mode if it has
changed. It would still fail when the files really differ, but I think
that's acceptable.

I note that pg_rewind doesn't need to distinguish between an empty and a
non-existent directory, so it's quite silly for it to pass
include_dot_dirs=true, and then filter out "." and ".." from the result
set. The documentation should mention the main reason for including "."
and "..": to distinguish between an empty and non-existent directory.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15Fujii Masao
masao.fujii@gmail.com
In reply to: Heikki Linnakangas (#14)
Re: pg_rewind failure by file deletion in source server

On Tue, Jun 23, 2015 at 9:19 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 06/23/2015 07:51 AM, Michael Paquier wrote:

So... Attached are a set of patches dedicated at fixing this issue:

Thanks for working on this!

- 0001, add if_not_exists to pg_tablespace_location, returning NULL if
path does not exist
- 0002, same with pg_stat_file, returning NULL if file does not exist
- 0003, same with pg_read_*file. I added them to all the existing
functions for consistency.
- 0004, pg_ls_dir extended with if_not_exists and include_dot_dirs
(thanks Robert for the naming!)
- 0005, as things get complex, a set of regression tests aimed to
covering those things. pg_tablespace_location is platform-dependent,
so there are no tests for it.
- 0006, the fix for pg_rewind, using what has been implemented before.

With thes patches, pg_read_file() will return NULL for any failure to open
the file, which makes pg_rewind to assume that the file doesn't exist in the
source server, and will remove the file from the destination. That's
dangerous, those functions should check specifically for ENOENT.

I'm wondering if using pg_read_file() to copy the file from source server
is reasonable. ISTM that it has two problems as follows.

1. It cannot read very large file like 1GB file. So if such large file was
created in source server after failover, pg_rewind would not be able
to copy the file. No?

2. Many users may not allow a remote client to connect to the
PostgreSQL server as a superuser for some security reasons. IOW,
there would be no entry in pg_hba.conf for such connection.
In this case, pg_rewind always fails because pg_read_file() needs
superuser privilege. No?

I'm tempting to implement the replication command version of
pg_read_file(). That is, it reads and sends the data like BASE_BACKUP
replication command does...

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Fujii Masao (#15)
Re: pg_rewind failure by file deletion in source server

On 06/23/2015 05:03 PM, Fujii Masao wrote:

On Tue, Jun 23, 2015 at 9:19 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 06/23/2015 07:51 AM, Michael Paquier wrote:

So... Attached are a set of patches dedicated at fixing this issue:

Thanks for working on this!

- 0001, add if_not_exists to pg_tablespace_location, returning NULL if
path does not exist
- 0002, same with pg_stat_file, returning NULL if file does not exist
- 0003, same with pg_read_*file. I added them to all the existing
functions for consistency.
- 0004, pg_ls_dir extended with if_not_exists and include_dot_dirs
(thanks Robert for the naming!)
- 0005, as things get complex, a set of regression tests aimed to
covering those things. pg_tablespace_location is platform-dependent,
so there are no tests for it.
- 0006, the fix for pg_rewind, using what has been implemented before.

With thes patches, pg_read_file() will return NULL for any failure to open
the file, which makes pg_rewind to assume that the file doesn't exist in the
source server, and will remove the file from the destination. That's
dangerous, those functions should check specifically for ENOENT.

I'm wondering if using pg_read_file() to copy the file from source server
is reasonable. ISTM that it has two problems as follows.

1. It cannot read very large file like 1GB file. So if such large file was
created in source server after failover, pg_rewind would not be able
to copy the file. No?

pg_read_binary_file() handles large files just fine. It cannot return
more than 1GB in one call, but you can call it several times and
retrieve the file in chunks. That's what pg_rewind does, except for
reading the control file, which is known to be small.

2. Many users may not allow a remote client to connect to the
PostgreSQL server as a superuser for some security reasons. IOW,
there would be no entry in pg_hba.conf for such connection.
In this case, pg_rewind always fails because pg_read_file() needs
superuser privilege. No?

I'm tempting to implement the replication command version of
pg_read_file(). That is, it reads and sends the data like BASE_BACKUP
replication command does...

Yeah, that would definitely be nice. Peter suggested it back in January
(/messages/by-id/54AC4801.7050300@gmx.net). I think
it's way too late to do that for 9.5, however. I'm particularly worried
that if we design the required API in a rush, we're not going to get it
right, and will have to change it again soon. That might be difficult in
a minor release. Using pg_read_file() and friends is quite flexible,
even though we just find out that they're not quite flexible enough
right now (the ENOENT problem).

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17Fujii Masao
masao.fujii@gmail.com
In reply to: Heikki Linnakangas (#16)
Re: pg_rewind failure by file deletion in source server

On Tue, Jun 23, 2015 at 11:21 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 06/23/2015 05:03 PM, Fujii Masao wrote:

On Tue, Jun 23, 2015 at 9:19 PM, Heikki Linnakangas <hlinnaka@iki.fi>
wrote:

On 06/23/2015 07:51 AM, Michael Paquier wrote:

So... Attached are a set of patches dedicated at fixing this issue:

Thanks for working on this!

- 0001, add if_not_exists to pg_tablespace_location, returning NULL if
path does not exist
- 0002, same with pg_stat_file, returning NULL if file does not exist
- 0003, same with pg_read_*file. I added them to all the existing
functions for consistency.
- 0004, pg_ls_dir extended with if_not_exists and include_dot_dirs
(thanks Robert for the naming!)
- 0005, as things get complex, a set of regression tests aimed to
covering those things. pg_tablespace_location is platform-dependent,
so there are no tests for it.
- 0006, the fix for pg_rewind, using what has been implemented before.

With thes patches, pg_read_file() will return NULL for any failure to
open
the file, which makes pg_rewind to assume that the file doesn't exist in
the
source server, and will remove the file from the destination. That's
dangerous, those functions should check specifically for ENOENT.

I'm wondering if using pg_read_file() to copy the file from source server
is reasonable. ISTM that it has two problems as follows.

1. It cannot read very large file like 1GB file. So if such large file was
created in source server after failover, pg_rewind would not be able
to copy the file. No?

pg_read_binary_file() handles large files just fine. It cannot return more
than 1GB in one call, but you can call it several times and retrieve the
file in chunks. That's what pg_rewind does, except for reading the control
file, which is known to be small.

Yeah, you're right.

I found that pg_rewind creates a temporary table to fetch the file in chunks.
This would prevent pg_rewind from using the *hot standby* server as a source
server at all. This is of course a limitation of pg_rewind, but we might want
to alleviate it in the future.

2. Many users may not allow a remote client to connect to the
PostgreSQL server as a superuser for some security reasons. IOW,
there would be no entry in pg_hba.conf for such connection.
In this case, pg_rewind always fails because pg_read_file() needs
superuser privilege. No?

I'm tempting to implement the replication command version of
pg_read_file(). That is, it reads and sends the data like BASE_BACKUP
replication command does...

Yeah, that would definitely be nice. Peter suggested it back in January
(/messages/by-id/54AC4801.7050300@gmx.net). I think
it's way too late to do that for 9.5, however. I'm particularly worried that
if we design the required API in a rush, we're not going to get it right,
and will have to change it again soon. That might be difficult in a minor
release. Using pg_read_file() and friends is quite flexible, even though we
just find out that they're not quite flexible enough right now (the ENOENT
problem).

I agree that it's too late to do what I said...

But just using pg_read_file() cannot address the #2 problem that I pointed
in my previous email. Also requiring a superuer privilege on pg_rewind
really conflicts with the motivation why we added replication privilege.

So we should change pg_read_file() so that even replication user can
read the file?
Or replication user version of pg_read_file() should be implemented?

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18Michael Paquier
michael.paquier@gmail.com
In reply to: Fujii Masao (#17)
Re: pg_rewind failure by file deletion in source server

On Wed, Jun 24, 2015 at 1:40 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Tue, Jun 23, 2015 at 11:21 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 06/23/2015 05:03 PM, Fujii Masao wrote:

On Tue, Jun 23, 2015 at 9:19 PM, Heikki Linnakangas <hlinnaka@iki.fi>
wrote:

On 06/23/2015 07:51 AM, Michael Paquier wrote:

So... Attached are a set of patches dedicated at fixing this issue:

Thanks for working on this!

- 0001, add if_not_exists to pg_tablespace_location, returning NULL if
path does not exist
- 0002, same with pg_stat_file, returning NULL if file does not exist
- 0003, same with pg_read_*file. I added them to all the existing
functions for consistency.
- 0004, pg_ls_dir extended with if_not_exists and include_dot_dirs
(thanks Robert for the naming!)
- 0005, as things get complex, a set of regression tests aimed to
covering those things. pg_tablespace_location is platform-dependent,
so there are no tests for it.
- 0006, the fix for pg_rewind, using what has been implemented before.

With thes patches, pg_read_file() will return NULL for any failure to
open
the file, which makes pg_rewind to assume that the file doesn't exist in
the
source server, and will remove the file from the destination. That's
dangerous, those functions should check specifically for ENOENT.

I'm wondering if using pg_read_file() to copy the file from source server
is reasonable. ISTM that it has two problems as follows.

1. It cannot read very large file like 1GB file. So if such large file was
created in source server after failover, pg_rewind would not be able
to copy the file. No?

pg_read_binary_file() handles large files just fine. It cannot return more
than 1GB in one call, but you can call it several times and retrieve the
file in chunks. That's what pg_rewind does, except for reading the control
file, which is known to be small.

Yeah, you're right.

I found that pg_rewind creates a temporary table to fetch the file in chunks.
This would prevent pg_rewind from using the *hot standby* server as a source
server at all. This is of course a limitation of pg_rewind, but we might want
to alleviate it in the future.

This is something that a replication command could address properly.

2. Many users may not allow a remote client to connect to the
PostgreSQL server as a superuser for some security reasons. IOW,
there would be no entry in pg_hba.conf for such connection.
In this case, pg_rewind always fails because pg_read_file() needs
superuser privilege. No?

I'm tempting to implement the replication command version of
pg_read_file(). That is, it reads and sends the data like BASE_BACKUP
replication command does...

Yeah, that would definitely be nice. Peter suggested it back in January
(/messages/by-id/54AC4801.7050300@gmx.net). I think
it's way too late to do that for 9.5, however. I'm particularly worried that
if we design the required API in a rush, we're not going to get it right,
and will have to change it again soon. That might be difficult in a minor
release. Using pg_read_file() and friends is quite flexible, even though we
just find out that they're not quite flexible enough right now (the ENOENT
problem).

I agree that it's too late to do what I said...

But just using pg_read_file() cannot address the #2 problem that I pointed
in my previous email. Also requiring a superuser privilege on pg_rewind
really conflicts with the motivation why we added replication privilege.

So we should change pg_read_file() so that even replication user can
read the file?

From the security prospective, a replication user can take a base
backup so it can already retrieve easily the contents of PGDATA. Hence
I guess that it would be fine. However, what about cases where
pg_hba.conf authorizes access to a given replication user via psql and
blocks it for the replication protocol? We could say that OP should
not give out replication access that easily, but in this case the user
would have access to the content of PGDATA even if he should not. Is
that unrealistic?

Or replication user version of pg_read_file() should be implemented?

You mean a new function? In what is it different from authorizing
pg_read_file usage for a replication user?

Honestly, I can live with this superuser restriction in 9.5. And come
back to the replication user restriction in 9.6 once things cool down
a bit.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19Fujii Masao
masao.fujii@gmail.com
In reply to: Michael Paquier (#18)
Re: pg_rewind failure by file deletion in source server

On Wed, Jun 24, 2015 at 3:36 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Wed, Jun 24, 2015 at 1:40 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Tue, Jun 23, 2015 at 11:21 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 06/23/2015 05:03 PM, Fujii Masao wrote:

On Tue, Jun 23, 2015 at 9:19 PM, Heikki Linnakangas <hlinnaka@iki.fi>
wrote:

On 06/23/2015 07:51 AM, Michael Paquier wrote:

So... Attached are a set of patches dedicated at fixing this issue:

Thanks for working on this!

- 0001, add if_not_exists to pg_tablespace_location, returning NULL if
path does not exist
- 0002, same with pg_stat_file, returning NULL if file does not exist
- 0003, same with pg_read_*file. I added them to all the existing
functions for consistency.
- 0004, pg_ls_dir extended with if_not_exists and include_dot_dirs
(thanks Robert for the naming!)
- 0005, as things get complex, a set of regression tests aimed to
covering those things. pg_tablespace_location is platform-dependent,
so there are no tests for it.
- 0006, the fix for pg_rewind, using what has been implemented before.

With thes patches, pg_read_file() will return NULL for any failure to
open
the file, which makes pg_rewind to assume that the file doesn't exist in
the
source server, and will remove the file from the destination. That's
dangerous, those functions should check specifically for ENOENT.

I'm wondering if using pg_read_file() to copy the file from source server
is reasonable. ISTM that it has two problems as follows.

1. It cannot read very large file like 1GB file. So if such large file was
created in source server after failover, pg_rewind would not be able
to copy the file. No?

pg_read_binary_file() handles large files just fine. It cannot return more
than 1GB in one call, but you can call it several times and retrieve the
file in chunks. That's what pg_rewind does, except for reading the control
file, which is known to be small.

Yeah, you're right.

I found that pg_rewind creates a temporary table to fetch the file in chunks.
This would prevent pg_rewind from using the *hot standby* server as a source
server at all. This is of course a limitation of pg_rewind, but we might want
to alleviate it in the future.

This is something that a replication command could address properly.

2. Many users may not allow a remote client to connect to the
PostgreSQL server as a superuser for some security reasons. IOW,
there would be no entry in pg_hba.conf for such connection.
In this case, pg_rewind always fails because pg_read_file() needs
superuser privilege. No?

I'm tempting to implement the replication command version of
pg_read_file(). That is, it reads and sends the data like BASE_BACKUP
replication command does...

Yeah, that would definitely be nice. Peter suggested it back in January
(/messages/by-id/54AC4801.7050300@gmx.net). I think
it's way too late to do that for 9.5, however. I'm particularly worried that
if we design the required API in a rush, we're not going to get it right,
and will have to change it again soon. That might be difficult in a minor
release. Using pg_read_file() and friends is quite flexible, even though we
just find out that they're not quite flexible enough right now (the ENOENT
problem).

I agree that it's too late to do what I said...

But just using pg_read_file() cannot address the #2 problem that I pointed
in my previous email. Also requiring a superuser privilege on pg_rewind
really conflicts with the motivation why we added replication privilege.

So we should change pg_read_file() so that even replication user can
read the file?

From the security prospective, a replication user can take a base
backup so it can already retrieve easily the contents of PGDATA. Hence
I guess that it would be fine. However, what about cases where
pg_hba.conf authorizes access to a given replication user via psql and
blocks it for the replication protocol? We could say that OP should
not give out replication access that easily, but in this case the user
would have access to the content of PGDATA even if he should not. Is
that unrealistic?

The most realistic case is that both source and target servers have
the pg_hba.conf containing the following authentication setting
regarding replication. That is, each server allows other to use the
replication user to connect to via replication protocol.

# TYPE DATABASE USER ADDRESS METHOD
host replication repuser X.X.X.X/Y md5

This case makes me think that allowing even replication user to
call pg_read_file() may not be good solution for us. Because
in that case replication user needs to log in the real database
like "postgres" to call pg_read_file(), but usually there would be
no entry allowing replication user to connect to any real database in
pg_hba.conf. So if we want to address this problem, replication
command version of pg_read_file() would be required. However,
that's too late to do for now...

Or replication user version of pg_read_file() should be implemented?

You mean a new function? In what is it different from authorizing
pg_read_file usage for a replication user?

Honestly, I can live with this superuser restriction in 9.5. And come
back to the replication user restriction in 9.6 once things cool down
a bit.

Yeah, finally I agree with you. This seems only approach we can adopt
for now.

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20Michael Paquier
michael.paquier@gmail.com
In reply to: Heikki Linnakangas (#14)
8 attachment(s)
Re: pg_rewind failure by file deletion in source server

On Tue, Jun 23, 2015 at 9:19 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 06/23/2015 07:51 AM, Michael Paquier wrote:

- 0001, add if_not_exists to pg_tablespace_location, returning NULL if
path does not exist
- 0002, same with pg_stat_file, returning NULL if file does not exist
- 0003, same with pg_read_*file. I added them to all the existing
functions for consistency.
- 0004, pg_ls_dir extended with if_not_exists and include_dot_dirs
(thanks Robert for the naming!)
- 0005, as things get complex, a set of regression tests aimed to
covering those things. pg_tablespace_location is platform-dependent,
so there are no tests for it.
- 0006, the fix for pg_rewind, using what has been implemented before.

With these patches, pg_read_file() will return NULL for any failure to open
the file, which makes pg_rewind to assume that the file doesn't exist in the
source server, and will remove the file from the destination. That's
dangerous, those functions should check specifically for ENOENT.

This makes sense. I changed all those functions to do so.

There's still a small race condition with tablespaces. If you run CREATE
TABLESPACE in the source server while pg_rewind is running, it's possible
that the recursive query that pg_rewind uses sees the symlink in pg_tblspc/
directory, but its snapshot doesn't see the row in pg_tablespace yet. It
will think that the symlink is a regular file, try to read it, and fail (if
we checked for ENOENT).
Actually, I think we need try to deal with symlinks a bit harder. Currently,
pg_rewind assumes that anything in pg_tblspace that has a matching row in
pg_tablespace is a symlink, and nothing else is. I think symlinks to
directories. I just noticed that pg_rewind fails miserable if pg_xlog is a
symlink, because of that:

----
The servers diverged at WAL position 0/3023F08 on timeline 1.
Rewinding from last common checkpoint at 0/2000060 on timeline 1

"data-master//pg_xlog" is not a directory
Failure, exiting
----

It may be possible that in this case the path is a symlink on the
source but not on the target, and vice-versa, so this looks too
restrictive to me if we begin to use pg_readlink. Think for example of
a symlink of pg_xlog that is not included in a base backup, we ignore
it in copy_fetch.c for pg_xlog and the others, I think that here as
well we should ignore those errors except for tablespaces.

I think we need to add a column to pg_stat_file output, to indicate symbolic
links, and add a pg_readlink() function. That still leaves a race condition
if the type of a file changes, i.e. a file is deleted and a directory with
the same name is created in its place, but that seems acceptable. I don't
think PostgreSQL ever does such a thing, so that could only happen if you
mess with the data directory manually while the server is running.

Hm. pg_stat_file uses now stat(), and not lstat() so it cannot make
the difference between what is a link or not. I have changed
pg_stat_file to use lstat instead to cover this case in my set of
patches, but a new function may be a better answer here. I have added
as well a pg_readlink() function on the stack, and actually the
if_not_exists mode of pg_tablespace_location is not needed anymore if
we rely on pg_readlink in this case. I have let it in the set of
patches though. This still looks useful for other utility tools.

I just realized another problem: We recently learned the hard way that some
people have files in the data directory that are not writeable by the
'postgres' user
(/messages/by-id/20150523172627.GA24277@msg.df7cb.de).
pg_rewind will try to overwrite all files it doesn't recognize as relation
files, so it's going to fail on those. A straightforward fix would be to
first open the destination file in read-only mode, and compare its contents,
and only open the file in write mode if it has changed. It would still fail
when the files really differ, but I think that's acceptable.

If I am missing nothing, two code paths need to be patched here:
copy_file_range and receiveFileChunks. copy_file_range is
straight-forward. Now wouldn't it be better to write the contents into
a temporary file, compare their content, and then switch if necessary
for receiveFileChunks?

I note that pg_rewind doesn't need to distinguish between an empty and a
non-existent directory, so it's quite silly for it to pass
include_dot_dirs=true, and then filter out "." and ".." from the result set.
The documentation should mention the main reason for including "." and "..":
to distinguish between an empty and non-existent directory.

OK. Switched to that in the first patch for pg_rewind.

Attached is a new set of patches. Except for the last ones that
addresses one issue of pg_rewind (symlink management when streaming
PGDATA), all the others introduce if_not_exists options for the
functions of genfile.c. The pg_rewind stuff could be more polished
though. Feel free to comment.
Regards,
--
Michael

Attachments:

0001-Extend-pg_tablespace_location-with-if_not_exists-opt.patchapplication/x-patch; name=0001-Extend-pg_tablespace_location-with-if_not_exists-opt.patchDownload
From 228d17d6808d350507a2ef137c7bdb496dc765d3 Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@otacoo.com>
Date: Wed, 24 Jun 2015 10:32:01 +0900
Subject: [PATCH 1/8] Extend pg_tablespace_location with if_not_exists option

This is useful for code paths that prefer receive an empty response
instead of an ERROR if tablespace specified does not exist.
---
 doc/src/sgml/func.sgml        |  8 +++++--
 src/backend/utils/adt/misc.c  | 49 ++++++++++++++++++++++++++++++++++++-------
 src/include/catalog/pg_proc.h |  2 ++
 src/include/utils/builtins.h  |  1 +
 4 files changed, 50 insertions(+), 10 deletions(-)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 650051b..7929e7e 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -15844,9 +15844,13 @@ SELECT pg_type_is_visible('myschema.widget'::regtype);
        <entry>get the set of database OIDs that have objects in the tablespace</entry>
       </row>
       <row>
-       <entry><literal><function>pg_tablespace_location(<parameter>tablespace_oid</parameter>)</function></literal></entry>
+       <entry><literal><function>pg_tablespace_location(<parameter>tablespace_oid</parameter> [, <parameter>if_not_exists</> <type>boolean</type>])</function></literal></entry>
        <entry><type>text</type></entry>
-       <entry>get the path in the file system that this tablespace is located in</entry>
+       <entry>
+        Get the path in the file system that this tablespace is located in.
+        If <parameter>if_not_exists</parameter> is true, <literal>NULL</literal>
+        is returned if tablespace does not exist instead of an error.
+       </entry>
       </row>
       <row>
        <entry><literal><function>pg_typeof(<parameter>any</parameter>)</function></literal></entry>
diff --git a/src/backend/utils/adt/misc.c b/src/backend/utils/adt/misc.c
index c0495d9..dc19c2c 100644
--- a/src/backend/utils/adt/misc.c
+++ b/src/backend/utils/adt/misc.c
@@ -336,17 +336,24 @@ pg_tablespace_databases(PG_FUNCTION_ARGS)
 	SRF_RETURN_DONE(funcctx);
 }
 
-
 /*
- * pg_tablespace_location - get location for a tablespace
+ * Wrapper function for pg_tablespace_location and
+ * pg_tablespace_location_extended.
  */
-Datum
-pg_tablespace_location(PG_FUNCTION_ARGS)
+static Datum
+tablespace_location_wrapper(FunctionCallInfo fcinfo)
 {
 	Oid			tablespaceOid = PG_GETARG_OID(0);
 	char		sourcepath[MAXPGPATH];
 	char		targetpath[MAXPGPATH];
 	int			rllen;
+	bool		if_not_exists = false;
+
+	/*
+	 * Check for IF NOT EXISTS option.
+	 */
+	if (PG_NARGS() == 2 && !PG_ARGISNULL(1))
+		if_not_exists = PG_GETARG_BOOL(1);
 
 	/*
 	 * It's useful to apply this function to pg_class.reltablespace, wherein
@@ -373,15 +380,22 @@ pg_tablespace_location(PG_FUNCTION_ARGS)
 
 	rllen = readlink(sourcepath, targetpath, sizeof(targetpath));
 	if (rllen < 0)
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not read symbolic link \"%s\": %m",
-						sourcepath)));
+	{
+		if (if_not_exists && errno == ENOENT)
+			PG_RETURN_NULL();
+		else
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read symbolic link \"%s\": %m",
+							sourcepath)));
+	}
 	if (rllen >= sizeof(targetpath))
+	{
 		ereport(ERROR,
 				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
 				 errmsg("symbolic link \"%s\" target is too long",
 						sourcepath)));
+	}
 	targetpath[rllen] = '\0';
 
 	PG_RETURN_TEXT_P(cstring_to_text(targetpath));
@@ -394,6 +408,25 @@ pg_tablespace_location(PG_FUNCTION_ARGS)
 }
 
 /*
+ * pg_tablespace_location - get location for a tablespace
+ */
+Datum
+pg_tablespace_location(PG_FUNCTION_ARGS)
+{
+	return tablespace_location_wrapper(fcinfo);
+}
+
+/*
+ * pg_tablespace_location - get location for a tablespace, with option to
+ * bypass errors in case of a non-existing tablespace.
+ */
+Datum
+pg_tablespace_location_extended(PG_FUNCTION_ARGS)
+{
+	return tablespace_location_wrapper(fcinfo);
+}
+
+/*
  * pg_sleep - delay for N seconds
  */
 Datum
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index 6b3d194..1f163f0 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -2922,6 +2922,8 @@ DESCR("current trigger depth");
 
 DATA(insert OID = 3778 ( pg_tablespace_location PGNSP PGUID 12 1 0 0 0 f f f f t f s 1 0 25 "26" _null_ _null_ _null_ _null_ _null_ pg_tablespace_location _null_ _null_ _null_ ));
 DESCR("tablespace location");
+DATA(insert OID = 3579 ( pg_tablespace_location PGNSP PGUID 12 1 0 0 0 f f f f t f s 2 0 25 "26 16" _null_ _null_ _null_ _null_ _null_ pg_tablespace_location_extended _null_ _null_ _null_ ));
+DESCR("tablespace location");
 
 DATA(insert OID = 1946 (  encode						PGNSP PGUID 12 1 0 0 0 f f f f t f i 2 0 25 "17 25" _null_ _null_ _null_ _null_ _null_ binary_encode _null_ _null_ _null_ ));
 DESCR("convert bytea value into some ascii-only text string");
diff --git a/src/include/utils/builtins.h b/src/include/utils/builtins.h
index 51f25a2..ab78ec2 100644
--- a/src/include/utils/builtins.h
+++ b/src/include/utils/builtins.h
@@ -486,6 +486,7 @@ extern Datum pg_terminate_backend(PG_FUNCTION_ARGS);
 extern Datum pg_reload_conf(PG_FUNCTION_ARGS);
 extern Datum pg_tablespace_databases(PG_FUNCTION_ARGS);
 extern Datum pg_tablespace_location(PG_FUNCTION_ARGS);
+extern Datum pg_tablespace_location_extended(PG_FUNCTION_ARGS);
 extern Datum pg_rotate_logfile(PG_FUNCTION_ARGS);
 extern Datum pg_sleep(PG_FUNCTION_ARGS);
 extern Datum pg_get_keywords(PG_FUNCTION_ARGS);
-- 
2.4.4

0002-Extend-pg_stat_file-with-if_not_exists-option.patchapplication/x-patch; name=0002-Extend-pg_stat_file-with-if_not_exists-option.patchDownload
From 2804fdc5fd974e688a930ab58559dbb7047cca8e Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@otacoo.com>
Date: Wed, 24 Jun 2015 10:34:17 +0900
Subject: [PATCH 2/8] Extend pg_stat_file with if_not_exists option

This is useful for code paths that prefer receiving a NULL result
instead of an ERROR if file specified does not exist.
---
 doc/src/sgml/func.sgml          |  9 ++++--
 src/backend/utils/adt/genfile.c | 65 +++++++++++++++++++++++++++++++----------
 src/include/catalog/pg_proc.h   |  2 ++
 src/include/utils/builtins.h    |  1 +
 4 files changed, 59 insertions(+), 18 deletions(-)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 7929e7e..0f70985 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -17830,10 +17830,15 @@ postgres=# SELECT * FROM pg_xlogfile_name_offset(pg_stop_backup());
       </row>
       <row>
        <entry>
-        <literal><function>pg_stat_file(<parameter>filename</> <type>text</>)</function></literal>
+        <literal><function>pg_stat_file(<parameter>filename</> <type>text</>[, <parameter>if_not_exists</> <type>boolean</type>])</function></literal>
        </entry>
        <entry><type>record</type></entry>
-       <entry>Return information about a file</entry>
+       <entry>
+        Return information about a file. If
+        <parameter>if_not_exists</parameter> is true, <literal>NULL</literal>
+        is returned for all result fields if file does not exist instead of
+        an error.
+       </entry>
       </row>
      </tbody>
     </tgroup>
diff --git a/src/backend/utils/adt/genfile.c b/src/backend/utils/adt/genfile.c
index f3f3cca..dd97d83 100644
--- a/src/backend/utils/adt/genfile.c
+++ b/src/backend/utils/adt/genfile.c
@@ -254,10 +254,10 @@ pg_read_binary_file_all(PG_FUNCTION_ARGS)
 }
 
 /*
- * stat a file
+ * Wrapper for pg_stat_file and pg_stat_file_extended.
  */
-Datum
-pg_stat_file(PG_FUNCTION_ARGS)
+static Datum
+stat_file_wrapper(PG_FUNCTION_ARGS)
 {
 	text	   *filename_t = PG_GETARG_TEXT_P(0);
 	char	   *filename;
@@ -266,18 +266,29 @@ pg_stat_file(PG_FUNCTION_ARGS)
 	bool		isnull[6];
 	HeapTuple	tuple;
 	TupleDesc	tupdesc;
+	bool		if_not_exists = false;
+	bool		return_null = false;
 
 	if (!superuser())
 		ereport(ERROR,
 				(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
 				 (errmsg("must be superuser to get file information"))));
 
+	/* Check for option if_not_exists */
+	if (PG_NARGS() == 2 && !PG_ARGISNULL(1))
+		if_not_exists = PG_GETARG_BOOL(1);
+
 	filename = convert_and_check_filename(filename_t);
 
 	if (stat(filename, &fst) < 0)
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not stat file \"%s\": %m", filename)));
+	{
+		if (if_not_exists && errno == ENOENT)
+			return_null = true;
+		else
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not stat file \"%s\": %m", filename)));
+	}
 
 	/*
 	 * This record type had better match the output parameters declared for me
@@ -298,20 +309,23 @@ pg_stat_file(PG_FUNCTION_ARGS)
 					   "isdir", BOOLOID, -1, 0);
 	BlessTupleDesc(tupdesc);
 
-	memset(isnull, false, sizeof(isnull));
+	memset(isnull, return_null, sizeof(isnull));
 
-	values[0] = Int64GetDatum((int64) fst.st_size);
-	values[1] = TimestampTzGetDatum(time_t_to_timestamptz(fst.st_atime));
-	values[2] = TimestampTzGetDatum(time_t_to_timestamptz(fst.st_mtime));
-	/* Unix has file status change time, while Win32 has creation time */
+	if (!return_null)
+	{
+		values[0] = Int64GetDatum((int64) fst.st_size);
+		values[1] = TimestampTzGetDatum(time_t_to_timestamptz(fst.st_atime));
+		values[2] = TimestampTzGetDatum(time_t_to_timestamptz(fst.st_mtime));
+		/* Unix has file status change time, while Win32 has creation time */
 #if !defined(WIN32) && !defined(__CYGWIN__)
-	values[3] = TimestampTzGetDatum(time_t_to_timestamptz(fst.st_ctime));
-	isnull[4] = true;
+		values[3] = TimestampTzGetDatum(time_t_to_timestamptz(fst.st_ctime));
+		isnull[4] = true;
 #else
-	isnull[3] = true;
-	values[4] = TimestampTzGetDatum(time_t_to_timestamptz(fst.st_ctime));
+		isnull[3] = true;
+		values[4] = TimestampTzGetDatum(time_t_to_timestamptz(fst.st_ctime));
 #endif
-	values[5] = BoolGetDatum(S_ISDIR(fst.st_mode));
+		values[5] = BoolGetDatum(S_ISDIR(fst.st_mode));
+	}
 
 	tuple = heap_form_tuple(tupdesc, values, isnull);
 
@@ -320,6 +334,25 @@ pg_stat_file(PG_FUNCTION_ARGS)
 	PG_RETURN_DATUM(HeapTupleGetDatum(tuple));
 }
 
+/*
+ * pg_stat_file_extended
+ * Stat a file, with possibility to bypass error in case of a missing file.
+ */
+Datum
+pg_stat_file_extended(PG_FUNCTION_ARGS)
+{
+	return stat_file_wrapper(fcinfo);
+}
+
+/*
+ * stat a file
+ */
+Datum
+pg_stat_file(PG_FUNCTION_ARGS)
+{
+	return stat_file_wrapper(fcinfo);
+}
+
 
 /*
  * List a directory (returns the filenames only)
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index 1f163f0..8da1781 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -3185,6 +3185,8 @@ DESCR("rotate log file");
 
 DATA(insert OID = 2623 ( pg_stat_file		PGNSP PGUID 12 1 0 0 0 f f f f t f v 1 0 2249 "25" "{25,20,1184,1184,1184,1184,16}" "{i,o,o,o,o,o,o}" "{filename,size,access,modification,change,creation,isdir}" _null_ _null_ pg_stat_file _null_ _null_ _null_ ));
 DESCR("get information about file");
+DATA(insert OID = 3307 ( pg_stat_file		PGNSP PGUID 12 1 0 0 0 f f f f t f v 2 0 2249 "25 16" "{25,16,20,1184,1184,1184,1184,16}" "{i,i,o,o,o,o,o,o}" "{filename,if_not_exists,size,access,modification,change,creation,isdir}" _null_ _null_ pg_stat_file_extended _null_ _null_ _null_ ));
+DESCR("get information about file");
 DATA(insert OID = 2624 ( pg_read_file		PGNSP PGUID 12 1 0 0 0 f f f f t f v 3 0 25 "25 20 20" _null_ _null_ _null_ _null_ _null_ pg_read_file _null_ _null_ _null_ ));
 DESCR("read text from a file");
 DATA(insert OID = 3826 ( pg_read_file		PGNSP PGUID 12 1 0 0 0 f f f f t f v 1 0 25 "25" _null_ _null_ _null_ _null_ _null_ pg_read_file_all _null_ _null_ _null_ ));
diff --git a/src/include/utils/builtins.h b/src/include/utils/builtins.h
index ab78ec2..e46650a 100644
--- a/src/include/utils/builtins.h
+++ b/src/include/utils/builtins.h
@@ -472,6 +472,7 @@ extern Datum pg_relation_filepath(PG_FUNCTION_ARGS);
 extern bytea *read_binary_file(const char *filename,
 				 int64 seek_offset, int64 bytes_to_read);
 extern Datum pg_stat_file(PG_FUNCTION_ARGS);
+extern Datum pg_stat_file_extended(PG_FUNCTION_ARGS);
 extern Datum pg_read_file(PG_FUNCTION_ARGS);
 extern Datum pg_read_file_all(PG_FUNCTION_ARGS);
 extern Datum pg_read_binary_file(PG_FUNCTION_ARGS);
-- 
2.4.4

0003-Add-IF-NOT-EXISTS-to-pg_read_file-and-pg_read_binary.patchapplication/x-patch; name=0003-Add-IF-NOT-EXISTS-to-pg_read_file-and-pg_read_binary.patchDownload
From 0b6b6b438d6a17573f929e50c3adfd726f0b3948 Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@otacoo.com>
Date: Wed, 24 Jun 2015 10:37:31 +0900
Subject: [PATCH 3/8] Add IF NOT EXISTS to pg_read_file and pg_read_binary_file

This is useful for code paths that do not want to fail if the file
queried does not exist.
---
 doc/src/sgml/func.sgml           |  16 +++-
 src/backend/commands/extension.c |   2 +-
 src/backend/utils/adt/genfile.c  | 174 +++++++++++++++++++++++++++++++++------
 src/include/catalog/pg_proc.h    |   8 ++
 src/include/utils/builtins.h     |   8 +-
 5 files changed, 179 insertions(+), 29 deletions(-)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 0f70985..edfea8b 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -17816,17 +17816,25 @@ postgres=# SELECT * FROM pg_xlogfile_name_offset(pg_stop_backup());
       </row>
       <row>
        <entry>
-        <literal><function>pg_read_file(<parameter>filename</> <type>text</> [, <parameter>offset</> <type>bigint</>, <parameter>length</> <type>bigint</>])</function></literal>
+        <literal><function>pg_read_file(<parameter>filename</> <type>text</> [, <parameter>offset</> <type>bigint</>, <parameter>length</> <type>bigint</> [, <parameter>if_not_exists</> <type>boolean</>] ])</function></literal>
        </entry>
        <entry><type>text</type></entry>
-       <entry>Return the contents of a text file</entry>
+       <entry>
+        Return the contents of a text file. If
+        <parameter>if_not_exists</parameter> is true, <literal>NULL</literal>
+        is returned if file does not exist instead of an error.
+       </entry>
       </row>
       <row>
        <entry>
-        <literal><function>pg_read_binary_file(<parameter>filename</> <type>text</> [, <parameter>offset</> <type>bigint</>, <parameter>length</> <type>bigint</>])</function></literal>
+        <literal><function>pg_read_binary_file(<parameter>filename</> <type>text</> [, <parameter>offset</> <type>bigint</>, <parameter>length</> <type>bigint</> [, <parameter>if_not_exists</> <type>boolean</>] ])</function></literal>
        </entry>
        <entry><type>bytea</type></entry>
-       <entry>Return the contents of a file</entry>
+       <entry>
+        Return the contents of a file. If
+        <parameter>if_not_exists</parameter> is true, <literal>NULL</literal>
+        is returned if file does not exist instead of an error.
+       </entry>
       </row>
       <row>
        <entry>
diff --git a/src/backend/commands/extension.c b/src/backend/commands/extension.c
index 5cc74d0..25bbe01 100644
--- a/src/backend/commands/extension.c
+++ b/src/backend/commands/extension.c
@@ -640,7 +640,7 @@ read_extension_script_file(const ExtensionControlFile *control,
 	char	   *dest_str;
 	int			len;
 
-	content = read_binary_file(filename, 0, -1);
+	content = read_binary_file(filename, 0, -1, false);
 
 	/* use database encoding if not given */
 	if (control->encoding < 0)
diff --git a/src/backend/utils/adt/genfile.c b/src/backend/utils/adt/genfile.c
index dd97d83..d45c66e 100644
--- a/src/backend/utils/adt/genfile.c
+++ b/src/backend/utils/adt/genfile.c
@@ -88,7 +88,8 @@ convert_and_check_filename(text *arg)
  * We read the whole of the file when bytes_to_read is negative.
  */
 bytea *
-read_binary_file(const char *filename, int64 seek_offset, int64 bytes_to_read)
+read_binary_file(const char *filename, int64 seek_offset,
+				 int64 bytes_to_read, bool if_not_exists)
 {
 	bytea	   *buf;
 	size_t		nbytes;
@@ -103,9 +104,14 @@ read_binary_file(const char *filename, int64 seek_offset, int64 bytes_to_read)
 			struct stat fst;
 
 			if (stat(filename, &fst) < 0)
-				ereport(ERROR,
-						(errcode_for_file_access(),
-						 errmsg("could not stat file \"%s\": %m", filename)));
+			{
+				if (if_not_exists && errno == ENOENT)
+					return NULL;
+				else
+					ereport(ERROR,
+							(errcode_for_file_access(),
+							 errmsg("could not stat file \"%s\": %m", filename)));
+			}
 
 			bytes_to_read = fst.st_size - seek_offset;
 		}
@@ -118,10 +124,15 @@ read_binary_file(const char *filename, int64 seek_offset, int64 bytes_to_read)
 				 errmsg("requested length too large")));
 
 	if ((file = AllocateFile(filename, PG_BINARY_R)) == NULL)
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not open file \"%s\" for reading: %m",
-						filename)));
+	{
+		if (if_not_exists && errno == ENOENT)
+			return NULL;
+		else
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not open file \"%s\" for reading: %m",
+							filename)));
+	}
 
 	if (fseeko(file, (off_t) seek_offset,
 			   (seek_offset >= 0) ? SEEK_SET : SEEK_END) != 0)
@@ -150,11 +161,16 @@ read_binary_file(const char *filename, int64 seek_offset, int64 bytes_to_read)
  * in the database encoding.
  */
 static text *
-read_text_file(const char *filename, int64 seek_offset, int64 bytes_to_read)
+read_text_file(const char *filename, int64 seek_offset,
+			   int64 bytes_to_read, bool if_not_exists)
 {
 	bytea	   *buf;
 
-	buf = read_binary_file(filename, seek_offset, bytes_to_read);
+	buf = read_binary_file(filename, seek_offset,
+						   bytes_to_read, if_not_exists);
+
+	if (buf == NULL)
+		return NULL;
 
 	/* Make sure the input is valid */
 	pg_verifymbstr(VARDATA(buf), VARSIZE(buf) - VARHDRSZ, false);
@@ -164,21 +180,27 @@ read_text_file(const char *filename, int64 seek_offset, int64 bytes_to_read)
 }
 
 /*
- * Read a section of a file, returning it as text
+ * Wrapper function for pg_read_file and pg_read_file_extended.
  */
-Datum
-pg_read_file(PG_FUNCTION_ARGS)
+static Datum
+read_file_wrapper(PG_FUNCTION_ARGS)
 {
 	text	   *filename_t = PG_GETARG_TEXT_P(0);
 	int64		seek_offset = PG_GETARG_INT64(1);
 	int64		bytes_to_read = PG_GETARG_INT64(2);
+	bool		if_not_exists = false;
 	char	   *filename;
+	bytea	   *result;
 
 	if (!superuser())
 		ereport(ERROR,
 				(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
 				 (errmsg("must be superuser to read files"))));
 
+	/* Check for option if_not_exists */
+	if (PG_NARGS() == 4 && !PG_ARGISNULL(3))
+		if_not_exists = PG_GETARG_BOOL(3);
+
 	filename = convert_and_check_filename(filename_t);
 
 	if (bytes_to_read < 0)
@@ -186,44 +208,100 @@ pg_read_file(PG_FUNCTION_ARGS)
 				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 				 errmsg("requested length cannot be negative")));
 
-	PG_RETURN_TEXT_P(read_text_file(filename, seek_offset, bytes_to_read));
+	result = read_text_file(filename, seek_offset, bytes_to_read,
+							if_not_exists);
+	if (result == NULL)
+		PG_RETURN_NULL();
+	PG_RETURN_TEXT_P(result);
 }
 
 /*
- * Read the whole of a file, returning it as text
+ * Read a section of a file, returning it as text
  */
 Datum
-pg_read_file_all(PG_FUNCTION_ARGS)
+pg_read_file(PG_FUNCTION_ARGS)
+{
+	return read_file_wrapper(fcinfo);
+}
+
+/*
+ * Similar to pg_read_file, with IF NOT EXISTS option.
+ */
+Datum
+pg_read_file_extended(PG_FUNCTION_ARGS)
+{
+	return read_file_wrapper(fcinfo);
+}
+
+/*
+ * Wrapper function for pg_read_file_all and pg_read_file_all_extended.
+ */
+static Datum
+read_file_all_wrapper(PG_FUNCTION_ARGS)
 {
 	text	   *filename_t = PG_GETARG_TEXT_P(0);
 	char	   *filename;
+	text	   *result;
+	bool		if_not_exists = false;
 
 	if (!superuser())
 		ereport(ERROR,
 				(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
 				 (errmsg("must be superuser to read files"))));
 
+	/* Check for option if_not_exists */
+	if (PG_NARGS() == 2 && !PG_ARGISNULL(1))
+		if_not_exists = PG_GETARG_BOOL(1);
+
 	filename = convert_and_check_filename(filename_t);
 
-	PG_RETURN_TEXT_P(read_text_file(filename, 0, -1));
+	result = read_text_file(filename, 0, -1, if_not_exists);
+	if (result == NULL)
+		PG_RETURN_NULL();
+
+	PG_RETURN_TEXT_P(result);
 }
 
 /*
- * Read a section of a file, returning it as bytea
+ * Read the whole of a file, returning it as text
  */
 Datum
-pg_read_binary_file(PG_FUNCTION_ARGS)
+pg_read_file_all(PG_FUNCTION_ARGS)
+{
+	return read_file_all_wrapper(fcinfo);
+}
+
+/*
+ * Similar to pg_read_file_all, with IF NOT EXISTS option.
+ */
+Datum
+pg_read_file_all_extended(PG_FUNCTION_ARGS)
+{
+	return read_file_all_wrapper(fcinfo);
+}
+
+/*
+ * Wrapper for pg_read_binary_file and pg_read_binary_file_extended.
+ */
+static Datum
+read_binary_file_wrapper(PG_FUNCTION_ARGS)
 {
 	text	   *filename_t = PG_GETARG_TEXT_P(0);
 	int64		seek_offset = PG_GETARG_INT64(1);
 	int64		bytes_to_read = PG_GETARG_INT64(2);
 	char	   *filename;
+	bool		if_not_exists = false;
+	bytea	   *result;
 
 	if (!superuser())
 		ereport(ERROR,
 				(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
 				 (errmsg("must be superuser to read files"))));
 
+	/* Check for option if_not_exists */
+	if (PG_NARGS() == 4 && !PG_ARGISNULL(3))
+		if_not_exists = PG_GETARG_BOOL(3);
+
 	filename = convert_and_check_filename(filename_t);
 
 	if (bytes_to_read < 0)
@@ -231,26 +309,76 @@ pg_read_binary_file(PG_FUNCTION_ARGS)
 				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 				 errmsg("requested length cannot be negative")));
 
-	PG_RETURN_BYTEA_P(read_binary_file(filename, seek_offset, bytes_to_read));
+	result = read_binary_file(filename, seek_offset,
+							  bytes_to_read, if_not_exists);
+	if (result == NULL)
+		PG_RETURN_NULL();
+	PG_RETURN_BYTEA_P(result);
 }
 
 /*
- * Read the whole of a file, returning it as bytea
+ * Read a section of a file, returning it as bytea
  */
 Datum
-pg_read_binary_file_all(PG_FUNCTION_ARGS)
+pg_read_binary_file(PG_FUNCTION_ARGS)
+{
+	return read_binary_file_wrapper(fcinfo);
+}
+
+/*
+ * Similar to pg_read_binary_file, with IF NOT EXISTS option.
+ */
+Datum
+pg_read_binary_file_extended(PG_FUNCTION_ARGS)
+{
+	return read_binary_file_wrapper(fcinfo);
+}
+
+/*
+ * Wrapper for pg_read_binary_file_all and pg_read_binary_file_all_extended.
+ */
+static Datum
+read_binary_file_all_wrapper(PG_FUNCTION_ARGS)
 {
 	text	   *filename_t = PG_GETARG_TEXT_P(0);
 	char	   *filename;
+	bool		if_not_exists = false;
+	text	   *result;
 
 	if (!superuser())
 		ereport(ERROR,
 				(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
 				 (errmsg("must be superuser to read files"))));
 
+	/* Check for option if_not_exists */
+	if (PG_NARGS() == 2 && !PG_ARGISNULL(1))
+		if_not_exists = PG_GETARG_BOOL(1);
+
 	filename = convert_and_check_filename(filename_t);
 
-	PG_RETURN_BYTEA_P(read_binary_file(filename, 0, -1));
+	result = read_binary_file(filename, 0, -1, if_not_exists);
+	if (result == NULL)
+		PG_RETURN_NULL();
+
+	PG_RETURN_BYTEA_P(result);
+}
+
+/*
+ * Read the whole of a file, returning it as bytea
+ */
+Datum
+pg_read_binary_file_all(PG_FUNCTION_ARGS)
+{
+	return read_binary_file_all_wrapper(fcinfo);
+}
+
+/*
+ * Similar to pg_read_binary_file_all, with IF NOT EXISTS option.
+ */
+Datum
+pg_read_binary_file_all_extended(PG_FUNCTION_ARGS)
+{
+	return read_binary_file_all_wrapper(fcinfo);
 }
 
 /*
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index 8da1781..c805a9a 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -3189,12 +3189,20 @@ DATA(insert OID = 3307 ( pg_stat_file		PGNSP PGUID 12 1 0 0 0 f f f f t f v 2 0
 DESCR("get information about file");
 DATA(insert OID = 2624 ( pg_read_file		PGNSP PGUID 12 1 0 0 0 f f f f t f v 3 0 25 "25 20 20" _null_ _null_ _null_ _null_ _null_ pg_read_file _null_ _null_ _null_ ));
 DESCR("read text from a file");
+DATA(insert OID = 3293 ( pg_read_file		PGNSP PGUID 12 1 0 0 0 f f f f t f v 4 0 25 "25 20 20 16" _null_ _null_ _null_ _null_ _null_ pg_read_file_extended _null_ _null_ _null_ ));
+DESCR("read text from a file");
 DATA(insert OID = 3826 ( pg_read_file		PGNSP PGUID 12 1 0 0 0 f f f f t f v 1 0 25 "25" _null_ _null_ _null_ _null_ _null_ pg_read_file_all _null_ _null_ _null_ ));
 DESCR("read text from a file");
+DATA(insert OID = 3294 ( pg_read_file		PGNSP PGUID 12 1 0 0 0 f f f f t f v 2 0 25 "25 16" _null_ _null_ _null_ _null_ _null_ pg_read_file_all_extended _null_ _null_ _null_ ));
+DESCR("read text from a file");
 DATA(insert OID = 3827 ( pg_read_binary_file	PGNSP PGUID 12 1 0 0 0 f f f f t f v 3 0 17 "25 20 20" _null_ _null_ _null_ _null_ _null_ pg_read_binary_file _null_ _null_ _null_ ));
 DESCR("read bytea from a file");
+DATA(insert OID = 3295 ( pg_read_binary_file	PGNSP PGUID 12 1 0 0 0 f f f f t f v 4 0 17 "25 20 20 16" _null_ _null_ _null_ _null_ _null_ pg_read_binary_file_extended _null_ _null_ _null_ ));
+DESCR("read bytea from a file");
 DATA(insert OID = 3828 ( pg_read_binary_file	PGNSP PGUID 12 1 0 0 0 f f f f t f v 1 0 17 "25" _null_ _null_ _null_ _null_ _null_ pg_read_binary_file_all _null_ _null_ _null_ ));
 DESCR("read bytea from a file");
+DATA(insert OID = 3296 ( pg_read_binary_file	PGNSP PGUID 12 1 0 0 0 f f f f t f v 2 0 17 "25 16" _null_ _null_ _null_ _null_ _null_ pg_read_binary_file_all_extended _null_ _null_ _null_ ));
+DESCR("read bytea from a file");
 DATA(insert OID = 2625 ( pg_ls_dir			PGNSP PGUID 12 1 1000 0 0 f f f f t t v 1 0 25 "25" _null_ _null_ _null_ _null_ _null_ pg_ls_dir _null_ _null_ _null_ ));
 DESCR("list all files in a directory");
 DATA(insert OID = 2626 ( pg_sleep			PGNSP PGUID 12 1 0 0 0 f f f f t f v 1 0 2278 "701" _null_ _null_ _null_ _null_ _null_ pg_sleep _null_ _null_ _null_ ));
diff --git a/src/include/utils/builtins.h b/src/include/utils/builtins.h
index e46650a..82d0e18 100644
--- a/src/include/utils/builtins.h
+++ b/src/include/utils/builtins.h
@@ -470,13 +470,19 @@ extern Datum pg_relation_filepath(PG_FUNCTION_ARGS);
 
 /* genfile.c */
 extern bytea *read_binary_file(const char *filename,
-				 int64 seek_offset, int64 bytes_to_read);
+							   int64 seek_offset,
+							   int64 bytes_to_read,
+							   bool if_not_exists);
 extern Datum pg_stat_file(PG_FUNCTION_ARGS);
 extern Datum pg_stat_file_extended(PG_FUNCTION_ARGS);
 extern Datum pg_read_file(PG_FUNCTION_ARGS);
+extern Datum pg_read_file_extended(PG_FUNCTION_ARGS);
 extern Datum pg_read_file_all(PG_FUNCTION_ARGS);
+extern Datum pg_read_file_all_extended(PG_FUNCTION_ARGS);
 extern Datum pg_read_binary_file(PG_FUNCTION_ARGS);
+extern Datum pg_read_binary_file_extended(PG_FUNCTION_ARGS);
 extern Datum pg_read_binary_file_all(PG_FUNCTION_ARGS);
+extern Datum pg_read_binary_file_all_extended(PG_FUNCTION_ARGS);
 extern Datum pg_ls_dir(PG_FUNCTION_ARGS);
 
 /* misc.c */
-- 
2.4.4

0004-Extend-pg_ls_dir-with-include_dot_dirs-and-if_not_ex.patchapplication/x-patch; name=0004-Extend-pg_ls_dir-with-include_dot_dirs-and-if_not_ex.patchDownload
From 3299e47802759d362a034d718492420e3bea62dd Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@otacoo.com>
Date: Wed, 24 Jun 2015 10:59:16 +0900
Subject: [PATCH 4/8] Extend pg_ls_dir with include_dot_dirs and if_not_exists

If include_dot_dirs is set to true, "." and ".." are listed in the list
of files listed. If if_not_exists is true, an empty list of files is
returned to caller instead of erroring out if the folder specified does
not exist.
---
 doc/src/sgml/func.sgml          | 11 +++++--
 src/backend/utils/adt/genfile.c | 65 ++++++++++++++++++++++++++++++++++-------
 src/include/catalog/pg_proc.h   |  2 ++
 src/include/utils/builtins.h    |  1 +
 4 files changed, 66 insertions(+), 13 deletions(-)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index edfea8b..3b5161a 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -17809,10 +17809,17 @@ postgres=# SELECT * FROM pg_xlogfile_name_offset(pg_stop_backup());
      <tbody>
       <row>
        <entry>
-        <literal><function>pg_ls_dir(<parameter>dirname</> <type>text</>)</function></literal>
+        <literal><function>pg_ls_dir(<parameter>dirname</> <type>text</> [, <parameter>if_not_exists</> <type>boolean</>, <parameter>include_dot_dirs</> <type>boolean</>] )</function></literal>
        </entry>
        <entry><type>setof text</type></entry>
-       <entry>List the contents of a directory</entry>
+       <entry>
+        List the contents of a directory.  If
+        <parameter>if_not_exists</parameter> is true, an empty result is
+        returned instead of an error. If <parameter>include_dot_dirs</>
+        is true, <quote>.</> and <quote>..</> are included as well in
+        the returned list, which is useful to distinguish an empty
+        and an non-existent directory.
+       </entry>
       </row>
       <row>
        <entry>
diff --git a/src/backend/utils/adt/genfile.c b/src/backend/utils/adt/genfile.c
index d45c66e..12866dd 100644
--- a/src/backend/utils/adt/genfile.c
+++ b/src/backend/utils/adt/genfile.c
@@ -35,6 +35,8 @@ typedef struct
 {
 	char	   *location;
 	DIR		   *dirdesc;
+	bool		include_dot_dirs;
+	bool		leave_early;
 } directory_fctx;
 
 
@@ -483,14 +485,15 @@ pg_stat_file(PG_FUNCTION_ARGS)
 
 
 /*
- * List a directory (returns the filenames only)
+ * Wrapper for pg_ls_dir and pg_ls_dir_extended.
  */
-Datum
-pg_ls_dir(PG_FUNCTION_ARGS)
+static Datum
+ls_dir_wrapper(PG_FUNCTION_ARGS)
 {
 	FuncCallContext *funcctx;
 	struct dirent *de;
 	directory_fctx *fctx;
+	MemoryContext	oldcontext;
 
 	if (!superuser())
 		ereport(ERROR,
@@ -499,7 +502,17 @@ pg_ls_dir(PG_FUNCTION_ARGS)
 
 	if (SRF_IS_FIRSTCALL())
 	{
-		MemoryContext oldcontext;
+		bool			if_not_exists = false;
+		bool			include_dot_dirs = false;
+
+		/* see if custom parameters have been passed down */
+		if (PG_NARGS() == 3)
+		{
+			if (!PG_ARGISNULL(1))
+				if_not_exists = PG_GETARG_BOOL(1);
+			if (!PG_ARGISNULL(2))
+				include_dot_dirs = PG_GETARG_BOOL(2);
+		}
 
 		funcctx = SRF_FIRSTCALL_INIT();
 		oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
@@ -507,14 +520,20 @@ pg_ls_dir(PG_FUNCTION_ARGS)
 		fctx = palloc(sizeof(directory_fctx));
 		fctx->location = convert_and_check_filename(PG_GETARG_TEXT_P(0));
 
+		fctx->include_dot_dirs = include_dot_dirs;
+		fctx->leave_early = false;
 		fctx->dirdesc = AllocateDir(fctx->location);
 
 		if (!fctx->dirdesc)
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not open directory \"%s\": %m",
-							fctx->location)));
-
+		{
+			if (if_not_exists && errno == ENOENT)
+				fctx->leave_early = true;
+			else
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not open directory \"%s\": %m",
+								fctx->location)));
+		}
 		funcctx->user_fctx = fctx;
 		MemoryContextSwitchTo(oldcontext);
 	}
@@ -522,10 +541,15 @@ pg_ls_dir(PG_FUNCTION_ARGS)
 	funcctx = SRF_PERCALL_SETUP();
 	fctx = (directory_fctx *) funcctx->user_fctx;
 
+	/* Leave earlier if needed */
+	if (fctx->leave_early)
+		SRF_RETURN_DONE(funcctx);
+
 	while ((de = ReadDir(fctx->dirdesc, fctx->location)) != NULL)
 	{
-		if (strcmp(de->d_name, ".") == 0 ||
-			strcmp(de->d_name, "..") == 0)
+		if (!fctx->include_dot_dirs &&
+			(strcmp(de->d_name, ".") == 0 ||
+			 strcmp(de->d_name, "..") == 0))
 			continue;
 
 		SRF_RETURN_NEXT(funcctx, CStringGetTextDatum(de->d_name));
@@ -535,3 +559,22 @@ pg_ls_dir(PG_FUNCTION_ARGS)
 
 	SRF_RETURN_DONE(funcctx);
 }
+
+/*
+ * List a directory (returns the filenames only)
+ */
+Datum
+pg_ls_dir(PG_FUNCTION_ARGS)
+{
+	return ls_dir_wrapper(fcinfo);
+}
+
+/*
+ * List a directory (can optionally return dot directories or
+ * use IF NOT EXISTS).
+ */
+Datum
+pg_ls_dir_extended(PG_FUNCTION_ARGS)
+{
+	return ls_dir_wrapper(fcinfo);
+}
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index c805a9a..b559572 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -3205,6 +3205,8 @@ DATA(insert OID = 3296 ( pg_read_binary_file	PGNSP PGUID 12 1 0 0 0 f f f f t f
 DESCR("read bytea from a file");
 DATA(insert OID = 2625 ( pg_ls_dir			PGNSP PGUID 12 1 1000 0 0 f f f f t t v 1 0 25 "25" _null_ _null_ _null_ _null_ _null_ pg_ls_dir _null_ _null_ _null_ ));
 DESCR("list all files in a directory");
+DATA(insert OID = 3297 ( pg_ls_dir			PGNSP PGUID 12 1 1000 0 0 f f f f t t v 3 0 25 "25 16 16" _null_ _null_ _null_ _null_ _null_ pg_ls_dir_extended _null_ _null_ _null_ ));
+DESCR("list all files in a directory");
 DATA(insert OID = 2626 ( pg_sleep			PGNSP PGUID 12 1 0 0 0 f f f f t f v 1 0 2278 "701" _null_ _null_ _null_ _null_ _null_ pg_sleep _null_ _null_ _null_ ));
 DESCR("sleep for the specified time in seconds");
 DATA(insert OID = 3935 ( pg_sleep_for			PGNSP PGUID 14 1 0 0 0 f f f f t f v 1 0 2278 "1186" _null_ _null_ _null_ _null_ _null_ "select pg_catalog.pg_sleep(extract(epoch from pg_catalog.clock_timestamp() operator(pg_catalog.+) $1) operator(pg_catalog.-) extract(epoch from pg_catalog.clock_timestamp()))" _null_ _null_ _null_ ));
diff --git a/src/include/utils/builtins.h b/src/include/utils/builtins.h
index 82d0e18..446cda6 100644
--- a/src/include/utils/builtins.h
+++ b/src/include/utils/builtins.h
@@ -484,6 +484,7 @@ extern Datum pg_read_binary_file_extended(PG_FUNCTION_ARGS);
 extern Datum pg_read_binary_file_all(PG_FUNCTION_ARGS);
 extern Datum pg_read_binary_file_all_extended(PG_FUNCTION_ARGS);
 extern Datum pg_ls_dir(PG_FUNCTION_ARGS);
+extern Datum pg_ls_dir_extended(PG_FUNCTION_ARGS);
 
 /* misc.c */
 extern Datum current_database(PG_FUNCTION_ARGS);
-- 
2.4.4

0005-Add-new-column-islink-in-pg_stat_file.patchapplication/x-patch; name=0005-Add-new-column-islink-in-pg_stat_file.patchDownload
From fae83e948c7f8bd9f8f33ef6f088f8d7b5cf465a Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@otacoo.com>
Date: Wed, 24 Jun 2015 11:26:02 +0900
Subject: [PATCH 5/8] Add new column islink in pg_stat_file

This allows to detect if the file path evaluated is a synlink or not.
pg_stat_file is switched to use as well lstat to enable soft link
detection.
---
 doc/src/sgml/func.sgml          |  5 +++--
 src/backend/utils/adt/genfile.c | 15 +++++++++++----
 src/include/catalog/pg_proc.h   |  4 ++--
 3 files changed, 16 insertions(+), 8 deletions(-)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 3b5161a..20b91ab 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -17903,8 +17903,9 @@ SELECT convert_from(pg_read_binary_file('file_in_utf8.txt'), 'UTF8');
     <function>pg_stat_file</> returns a record containing the file
     size, last accessed time stamp, last modified time stamp,
     last file status change time stamp (Unix platforms only),
-    file creation time stamp (Windows only), and a <type>boolean</type>
-    indicating if it is a directory.  Typical usages include:
+    file creation time stamp (Windows only), a <type>boolean</type>
+    indicating if it is a directory, and a <type>boolean</type>
+    indicating if it is a symbolic link.  Typical usages include:
 <programlisting>
 SELECT * FROM pg_stat_file('filename');
 SELECT (pg_stat_file('filename')).modification;
diff --git a/src/backend/utils/adt/genfile.c b/src/backend/utils/adt/genfile.c
index 12866dd..456061b 100644
--- a/src/backend/utils/adt/genfile.c
+++ b/src/backend/utils/adt/genfile.c
@@ -392,8 +392,8 @@ stat_file_wrapper(PG_FUNCTION_ARGS)
 	text	   *filename_t = PG_GETARG_TEXT_P(0);
 	char	   *filename;
 	struct stat fst;
-	Datum		values[6];
-	bool		isnull[6];
+	Datum		values[7];
+	bool		isnull[7];
 	HeapTuple	tuple;
 	TupleDesc	tupdesc;
 	bool		if_not_exists = false;
@@ -410,7 +410,7 @@ stat_file_wrapper(PG_FUNCTION_ARGS)
 
 	filename = convert_and_check_filename(filename_t);
 
-	if (stat(filename, &fst) < 0)
+	if (lstat(filename, &fst) < 0)
 	{
 		if (if_not_exists && errno == ENOENT)
 			return_null = true;
@@ -424,7 +424,7 @@ stat_file_wrapper(PG_FUNCTION_ARGS)
 	 * This record type had better match the output parameters declared for me
 	 * in pg_proc.h.
 	 */
-	tupdesc = CreateTemplateTupleDesc(6, false);
+	tupdesc = CreateTemplateTupleDesc(7, false);
 	TupleDescInitEntry(tupdesc, (AttrNumber) 1,
 					   "size", INT8OID, -1, 0);
 	TupleDescInitEntry(tupdesc, (AttrNumber) 2,
@@ -437,6 +437,8 @@ stat_file_wrapper(PG_FUNCTION_ARGS)
 					   "creation", TIMESTAMPTZOID, -1, 0);
 	TupleDescInitEntry(tupdesc, (AttrNumber) 6,
 					   "isdir", BOOLOID, -1, 0);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 7,
+					   "islink", BOOLOID, -1, 0);
 	BlessTupleDesc(tupdesc);
 
 	memset(isnull, return_null, sizeof(isnull));
@@ -455,6 +457,11 @@ stat_file_wrapper(PG_FUNCTION_ARGS)
 		values[4] = TimestampTzGetDatum(time_t_to_timestamptz(fst.st_ctime));
 #endif
 		values[5] = BoolGetDatum(S_ISDIR(fst.st_mode));
+#ifndef WIN32
+		values[6] = BoolGetDatum(S_ISLNK(fst.st_mode));
+#else
+		values[6] = BoolGetDatum(pgwin32_is_junction(filename));
+#endif
 	}
 
 	tuple = heap_form_tuple(tupdesc, values, isnull);
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index b559572..36d3e15 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -3183,9 +3183,9 @@ DESCR("reload configuration files");
 DATA(insert OID = 2622 ( pg_rotate_logfile		PGNSP PGUID 12 1 0 0 0 f f f f t f v 0 0 16 "" _null_ _null_ _null_ _null_ _null_ pg_rotate_logfile _null_ _null_ _null_ ));
 DESCR("rotate log file");
 
-DATA(insert OID = 2623 ( pg_stat_file		PGNSP PGUID 12 1 0 0 0 f f f f t f v 1 0 2249 "25" "{25,20,1184,1184,1184,1184,16}" "{i,o,o,o,o,o,o}" "{filename,size,access,modification,change,creation,isdir}" _null_ _null_ pg_stat_file _null_ _null_ _null_ ));
+DATA(insert OID = 2623 ( pg_stat_file		PGNSP PGUID 12 1 0 0 0 f f f f t f v 1 0 2249 "25" "{25,20,1184,1184,1184,1184,16,16}" "{i,o,o,o,o,o,o,o}" "{filename,size,access,modification,change,creation,isdir,islink}" _null_ _null_ pg_stat_file _null_ _null_ _null_ ));
 DESCR("get information about file");
-DATA(insert OID = 3307 ( pg_stat_file		PGNSP PGUID 12 1 0 0 0 f f f f t f v 2 0 2249 "25 16" "{25,16,20,1184,1184,1184,1184,16}" "{i,i,o,o,o,o,o,o}" "{filename,if_not_exists,size,access,modification,change,creation,isdir}" _null_ _null_ pg_stat_file_extended _null_ _null_ _null_ ));
+DATA(insert OID = 3307 ( pg_stat_file		PGNSP PGUID 12 1 0 0 0 f f f f t f v 2 0 2249 "25 16" "{25,16,20,1184,1184,1184,1184,16,16}" "{i,i,o,o,o,o,o,o,o}" "{filename,if_not_exists,size,access,modification,change,creation,isdir,islink}" _null_ _null_ pg_stat_file_extended _null_ _null_ _null_ ));
 DESCR("get information about file");
 DATA(insert OID = 2624 ( pg_read_file		PGNSP PGUID 12 1 0 0 0 f f f f t f v 3 0 25 "25 20 20" _null_ _null_ _null_ _null_ _null_ pg_read_file _null_ _null_ _null_ ));
 DESCR("read text from a file");
-- 
2.4.4

0006-Add-pg_readlink-to-get-value-of-a-symbolic-link.patchapplication/x-patch; name=0006-Add-pg_readlink-to-get-value-of-a-symbolic-link.patchDownload
From ef78401e6b136577fb1703a6820743ab404631b1 Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@otacoo.com>
Date: Wed, 24 Jun 2015 13:03:08 +0900
Subject: [PATCH 6/8] Add pg_readlink to get value of a symbolic link

This is similar to Posix's readlink, except that PostgreSQL restrictions
are applied to the source path queried, similarly to other functions of
genfile.c.
---
 doc/src/sgml/func.sgml          |  9 +++++++
 src/backend/utils/adt/genfile.c | 54 +++++++++++++++++++++++++++++++++++++++++
 src/include/catalog/pg_proc.h   |  2 ++
 src/include/utils/builtins.h    |  1 +
 4 files changed, 66 insertions(+)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 20b91ab..81a8090 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -17855,6 +17855,15 @@ postgres=# SELECT * FROM pg_xlogfile_name_offset(pg_stop_backup());
         an error.
        </entry>
       </row>
+      <row>
+       <entry>
+        <literal><function>pg_readlink(<parameter>filename</> <type>text</>)</function></literal>
+       </entry>
+       <entry><type>text</type></entry>
+       <entry>
+        Return value of a symbolic link.
+       </entry>
+      </row>
      </tbody>
     </tgroup>
    </table>
diff --git a/src/backend/utils/adt/genfile.c b/src/backend/utils/adt/genfile.c
index 456061b..94ff7d6 100644
--- a/src/backend/utils/adt/genfile.c
+++ b/src/backend/utils/adt/genfile.c
@@ -585,3 +585,57 @@ pg_ls_dir_extended(PG_FUNCTION_ARGS)
 {
 	return ls_dir_wrapper(fcinfo);
 }
+
+
+/*
+ * pg_readlink - Find the real location of the provided symbolic link.
+ */
+Datum
+pg_readlink(PG_FUNCTION_ARGS)
+{
+	text	   *sourcepath_t = PG_GETARG_TEXT_P(0);
+	char	   *sourcepath;
+	char        targetpath[MAXPGPATH];
+	int			rllen;
+	bool		if_not_exists = false;
+
+	if (!superuser())
+		ereport(ERROR,
+				(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+				 (errmsg("must be superuser to read files"))));
+
+	/* Check for option if_not_exists */
+	if (PG_NARGS() == 2 && !PG_ARGISNULL(1))
+		if_not_exists = PG_GETARG_BOOL(1);
+
+#if defined(HAVE_READLINK) || defined(WIN32)
+	sourcepath = convert_and_check_filename(sourcepath_t);
+
+	rllen = readlink(sourcepath, targetpath, sizeof(targetpath));
+	if (rllen < 0)
+	{
+		if (if_not_exists && errno == ENOENT)
+			PG_RETURN_NULL();
+		else
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read symbolic link \"%s\": %m",
+							sourcepath)));
+	}
+	if (rllen >= sizeof(targetpath))
+	{
+		ereport(ERROR,
+				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
+				 errmsg("symbolic link \"%s\" target is too long",
+						sourcepath)));
+	}
+	targetpath[rllen] = '\0';
+
+	PG_RETURN_TEXT_P(cstring_to_text(targetpath));
+#else
+	ereport(ERROR,
+			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			 errmsg("readline is not available on this platform")));
+	PG_RETURN_NULL();
+#endif
+}
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index 36d3e15..11dd80d 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -3207,6 +3207,8 @@ DATA(insert OID = 2625 ( pg_ls_dir			PGNSP PGUID 12 1 1000 0 0 f f f f t t v 1 0
 DESCR("list all files in a directory");
 DATA(insert OID = 3297 ( pg_ls_dir			PGNSP PGUID 12 1 1000 0 0 f f f f t t v 3 0 25 "25 16 16" _null_ _null_ _null_ _null_ _null_ pg_ls_dir_extended _null_ _null_ _null_ ));
 DESCR("list all files in a directory");
+DATA(insert OID = 3298 ( pg_readlink		PGNSP PGUID 12 1 0 0 0 f f f f t f s 1 0 25 "25" _null_ _null_ _null_ _null_ _null_ pg_readlink _null_ _null_ _null_ ));
+DESCR("read value of a symbolic link");
 DATA(insert OID = 2626 ( pg_sleep			PGNSP PGUID 12 1 0 0 0 f f f f t f v 1 0 2278 "701" _null_ _null_ _null_ _null_ _null_ pg_sleep _null_ _null_ _null_ ));
 DESCR("sleep for the specified time in seconds");
 DATA(insert OID = 3935 ( pg_sleep_for			PGNSP PGUID 14 1 0 0 0 f f f f t f v 1 0 2278 "1186" _null_ _null_ _null_ _null_ _null_ "select pg_catalog.pg_sleep(extract(epoch from pg_catalog.clock_timestamp() operator(pg_catalog.+) $1) operator(pg_catalog.-) extract(epoch from pg_catalog.clock_timestamp()))" _null_ _null_ _null_ ));
diff --git a/src/include/utils/builtins.h b/src/include/utils/builtins.h
index 446cda6..27a4fd8 100644
--- a/src/include/utils/builtins.h
+++ b/src/include/utils/builtins.h
@@ -485,6 +485,7 @@ extern Datum pg_read_binary_file_all(PG_FUNCTION_ARGS);
 extern Datum pg_read_binary_file_all_extended(PG_FUNCTION_ARGS);
 extern Datum pg_ls_dir(PG_FUNCTION_ARGS);
 extern Datum pg_ls_dir_extended(PG_FUNCTION_ARGS);
+extern Datum pg_readlink(PG_FUNCTION_ARGS);
 
 /* misc.c */
 extern Datum current_database(PG_FUNCTION_ARGS);
-- 
2.4.4

0007-Add-regression-tests-for-pg_ls_dir-and-pg_read_-bina.patchapplication/x-patch; name=0007-Add-regression-tests-for-pg_ls_dir-and-pg_read_-bina.patchDownload
From e39f8148f2baa95a6426f46f407eeb0fa243a478 Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@otacoo.com>
Date: Tue, 23 Jun 2015 10:42:48 +0900
Subject: [PATCH 7/8] Add regression tests for pg_ls_dir and
 pg_read_[binary_]file

With the new options if_not_exists that have been added, this looks
worth adding.
---
 src/test/regress/expected/file.out | 84 ++++++++++++++++++++++++++++++++++++++
 src/test/regress/parallel_schedule |  3 ++
 src/test/regress/sql/file.sql      | 29 +++++++++++++
 3 files changed, 116 insertions(+)
 create mode 100644 src/test/regress/expected/file.out
 create mode 100644 src/test/regress/sql/file.sql

diff --git a/src/test/regress/expected/file.out b/src/test/regress/expected/file.out
new file mode 100644
index 0000000..0ddae30
--- /dev/null
+++ b/src/test/regress/expected/file.out
@@ -0,0 +1,84 @@
+--
+-- FILE
+--
+-- Tests of SQL functions interacting directly with files of PGDATA
+-- pg_read_file
+SELECT pg_read_file('postgresql.auto.conf', 0, 23); -- OK
+      pg_read_file       
+-------------------------
+ # Do not edit this file
+(1 row)
+
+SELECT pg_read_file('postgresql_not_exists', 0, 23); -- not OK
+ERROR:  could not open file "postgresql_not_exists" for reading: No such file or directory
+SELECT pg_read_file('postgresql_not_exists', 0, 23, false); -- not OK
+ERROR:  could not open file "postgresql_not_exists" for reading: No such file or directory
+SELECT pg_read_file('postgresql_not_exists', 0, 23, true); -- OK
+ pg_read_file 
+--------------
+ 
+(1 row)
+
+-- pg_read_binary_file
+SELECT pg_read_binary_file('postgresql.auto.conf', 0, 23); -- OK
+               pg_read_binary_file                
+--------------------------------------------------
+ \x2320446f206e6f74206564697420746869732066696c65
+(1 row)
+
+SELECT pg_read_binary_file('postgresql_not_exists', 0, 23); -- not OK
+ERROR:  could not open file "postgresql_not_exists" for reading: No such file or directory
+SELECT pg_read_binary_file('postgresql_not_exists', 0, 23, false); -- not OK
+ERROR:  could not open file "postgresql_not_exists" for reading: No such file or directory
+SELECT pg_read_binary_file('postgresql_not_exists', 0, 23, true); -- OK
+ pg_read_binary_file 
+---------------------
+ 
+(1 row)
+
+-- pg_ls_dir
+SELECT pg_ls_dir('postgresql.conf'); -- not OK, not a directory
+ERROR:  could not open directory "postgresql.conf": Not a directory
+SELECT pg_ls_dir('pg_twophase'); -- empty directory
+ pg_ls_dir 
+-----------
+(0 rows)
+
+SELECT pg_ls_dir('pg_twophase', true, true); -- OK, not empty list
+ pg_ls_dir 
+-----------
+ .
+ ..
+(2 rows)
+
+SELECT pg_ls_dir('pg_twophase', false, false); -- OK, empty list
+ pg_ls_dir 
+-----------
+(0 rows)
+
+SELECT pg_ls_dir('pg_twophase', false, true); -- OK, not empty list
+ pg_ls_dir 
+-----------
+ .
+ ..
+(2 rows)
+
+SELECT pg_ls_dir('pg_twophase', true, false); -- OK, empty list
+ pg_ls_dir 
+-----------
+(0 rows)
+
+SELECT pg_ls_dir('postgresql_not_exists', false, true); -- not OK
+ERROR:  could not open directory "postgresql_not_exists": No such file or directory
+SELECT pg_ls_dir('postgresql_not_exists', false, false); -- not OK
+ERROR:  could not open directory "postgresql_not_exists": No such file or directory
+SELECT pg_ls_dir('postgresql_not_exists', true, true); -- OK, empty list
+ pg_ls_dir 
+-----------
+(0 rows)
+
+SELECT pg_ls_dir('postgresql_not_exists', true, false); -- OK, empty list
+ pg_ls_dir 
+-----------
+(0 rows)
+
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index 91780cd..b1739fa 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -108,5 +108,8 @@ test: plancache limit plpgsql copy2 temp domain rangefuncs prepare without_oid c
 # event triggers cannot run concurrently with any test that runs DDL
 test: event_trigger
 
+# Files read during this test may be modified in parallel by other tests
+test: file
+
 # run stats by itself because its delay may be insufficient under heavy load
 test: stats
diff --git a/src/test/regress/sql/file.sql b/src/test/regress/sql/file.sql
new file mode 100644
index 0000000..a0238f2
--- /dev/null
+++ b/src/test/regress/sql/file.sql
@@ -0,0 +1,29 @@
+--
+-- FILE
+--
+
+-- Tests of SQL functions interacting directly with files of PGDATA
+
+-- pg_read_file
+SELECT pg_read_file('postgresql.auto.conf', 0, 23); -- OK
+SELECT pg_read_file('postgresql_not_exists', 0, 23); -- not OK
+SELECT pg_read_file('postgresql_not_exists', 0, 23, false); -- not OK
+SELECT pg_read_file('postgresql_not_exists', 0, 23, true); -- OK
+
+-- pg_read_binary_file
+SELECT pg_read_binary_file('postgresql.auto.conf', 0, 23); -- OK
+SELECT pg_read_binary_file('postgresql_not_exists', 0, 23); -- not OK
+SELECT pg_read_binary_file('postgresql_not_exists', 0, 23, false); -- not OK
+SELECT pg_read_binary_file('postgresql_not_exists', 0, 23, true); -- OK
+
+-- pg_ls_dir
+SELECT pg_ls_dir('postgresql.conf'); -- not OK, not a directory
+SELECT pg_ls_dir('pg_twophase'); -- empty directory
+SELECT pg_ls_dir('pg_twophase', true, true); -- OK, not empty list
+SELECT pg_ls_dir('pg_twophase', false, false); -- OK, empty list
+SELECT pg_ls_dir('pg_twophase', false, true); -- OK, not empty list
+SELECT pg_ls_dir('pg_twophase', true, false); -- OK, empty list
+SELECT pg_ls_dir('postgresql_not_exists', false, true); -- not OK
+SELECT pg_ls_dir('postgresql_not_exists', false, false); -- not OK
+SELECT pg_ls_dir('postgresql_not_exists', true, true); -- OK, empty list
+SELECT pg_ls_dir('postgresql_not_exists', true, false); -- OK, empty list
-- 
2.4.4

0008-Fix-symlink-usage-in-pg_rewind.patchapplication/x-patch; name=0008-Fix-symlink-usage-in-pg_rewind.patchDownload
From 370cc091048a2259e633ca65473ed1a835e13451 Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@otacoo.com>
Date: Wed, 24 Jun 2015 15:10:16 +0900
Subject: [PATCH 8/8] Fix symlink usage in pg_rewind

Make use of pg_readlink to determine where a symlink in a source PGDATA
streamed is from. Also remove restrictions related to symlink comparisons
on source and target instances, those need to be only limited to
tablespaces.
---
 src/bin/pg_rewind/file_ops.c    |  2 ++
 src/bin/pg_rewind/filemap.c     | 11 ++-----
 src/bin/pg_rewind/filemap.h     |  1 +
 src/bin/pg_rewind/libpq_fetch.c | 63 +++++++++++++++++++++++++----------------
 4 files changed, 45 insertions(+), 32 deletions(-)

diff --git a/src/bin/pg_rewind/file_ops.c b/src/bin/pg_rewind/file_ops.c
index c2d8aa1..29ddb1d 100644
--- a/src/bin/pg_rewind/file_ops.c
+++ b/src/bin/pg_rewind/file_ops.c
@@ -139,6 +139,7 @@ remove_target(file_entry_t *entry)
 			remove_target_file(entry->path);
 			break;
 
+		case FILE_TYPE_TABLESPACE:
 		case FILE_TYPE_SYMLINK:
 			remove_target_symlink(entry->path);
 			break;
@@ -156,6 +157,7 @@ create_target(file_entry_t *entry)
 			create_target_dir(entry->path);
 			break;
 
+		case FILE_TYPE_TABLESPACE:
 		case FILE_TYPE_SYMLINK:
 			create_target_symlink(entry->path, entry->link_target);
 			break;
diff --git a/src/bin/pg_rewind/filemap.c b/src/bin/pg_rewind/filemap.c
index 05eff68..6ef2342 100644
--- a/src/bin/pg_rewind/filemap.c
+++ b/src/bin/pg_rewind/filemap.c
@@ -112,12 +112,6 @@ process_source_file(const char *path, file_type_t type, size_t newsize,
 	switch (type)
 	{
 		case FILE_TYPE_DIRECTORY:
-			if (exists && !S_ISDIR(statbuf.st_mode))
-			{
-				/* it's a directory in source, but not in target. Strange.. */
-				pg_fatal("\"%s\" is not a directory\n", localpath);
-			}
-
 			if (!exists)
 				action = FILE_ACTION_CREATE;
 			else
@@ -125,7 +119,7 @@ process_source_file(const char *path, file_type_t type, size_t newsize,
 			oldsize = 0;
 			break;
 
-		case FILE_TYPE_SYMLINK:
+		case FILE_TYPE_TABLESPACE:
 			if (exists &&
 #ifndef WIN32
 				!S_ISLNK(statbuf.st_mode)
@@ -140,7 +134,8 @@ process_source_file(const char *path, file_type_t type, size_t newsize,
 				 */
 				pg_fatal("\"%s\" is not a symbolic link\n", localpath);
 			}
-
+			/* fallback to default */
+		case FILE_TYPE_SYMLINK:
 			if (!exists)
 				action = FILE_ACTION_CREATE;
 			else
diff --git a/src/bin/pg_rewind/filemap.h b/src/bin/pg_rewind/filemap.h
index 9943ec5..35540d6 100644
--- a/src/bin/pg_rewind/filemap.h
+++ b/src/bin/pg_rewind/filemap.h
@@ -36,6 +36,7 @@ typedef enum
 {
 	FILE_TYPE_REGULAR,
 	FILE_TYPE_DIRECTORY,
+	FILE_TYPE_TABLESPACE,
 	FILE_TYPE_SYMLINK
 } file_type_t;
 
diff --git a/src/bin/pg_rewind/libpq_fetch.c b/src/bin/pg_rewind/libpq_fetch.c
index df71069..49950b9 100644
--- a/src/bin/pg_rewind/libpq_fetch.c
+++ b/src/bin/pg_rewind/libpq_fetch.c
@@ -140,30 +140,28 @@ libpqProcessFileList(void)
 	 * Create a recursive directory listing of the whole data directory.
 	 *
 	 * The WITH RECURSIVE part does most of the work. The second part gets the
-	 * targets of the symlinks in pg_tblspc directory.
-	 *
-	 * XXX: There is no backend function to get a symbolic link's target in
-	 * general, so if the admin has put any custom symbolic links in the data
-	 * directory, they won't be copied correctly.
+	 * targets of the symlinks in the data directory.
 	 */
 	sql =
 		"WITH RECURSIVE files (path, filename, size, isdir) AS (\n"
-		"  SELECT '' AS path, filename, size, isdir FROM\n"
-		"  (SELECT pg_ls_dir('.') AS filename) AS fn,\n"
-		"        pg_stat_file(fn.filename) AS this\n"
+		"  SELECT '' AS path, filename, size, isdir, islink FROM\n"
+		"  (SELECT pg_ls_dir('.', true, false) AS filename) AS fn,\n"
+		"        pg_stat_file(fn.filename, true) AS this\n"
+		"    WHERE this.size IS NOT NULL\n"
 		"  UNION ALL\n"
 		"  SELECT parent.path || parent.filename || '/' AS path,\n"
-		"         fn, this.size, this.isdir\n"
+		"         fn, this.size, this.isdir, this.islink\n"
 		"  FROM files AS parent,\n"
-		"       pg_ls_dir(parent.path || parent.filename) AS fn,\n"
-		"       pg_stat_file(parent.path || parent.filename || '/' || fn) AS this\n"
-		"       WHERE parent.isdir = 't'\n"
-		")\n"
-		"SELECT path || filename, size, isdir,\n"
-		"       pg_tablespace_location(pg_tablespace.oid) AS link_target\n"
-		"FROM files\n"
-		"LEFT OUTER JOIN pg_tablespace ON files.path = 'pg_tblspc/'\n"
-		"                             AND oid::text = files.filename\n";
+		"       pg_ls_dir(parent.path || parent.filename, true, false) AS fn,\n"
+		"       pg_stat_file(parent.path || parent.filename || '/' || fn, true)\n"
+		"              AS this\n"
+		"       WHERE parent.isdir = 't' AND\n"
+		"	          this.size IS NOT NULL)\n"
+		"SELECT path || filename AS file_path, size, isdir, islink,\n"
+		"       path = 'pg_tblspc/' AS istblspc,\n"
+		"       CASE WHEN islink THEN pg_readlink(path || filename)\n"
+		"            ELSE NULL END AS link_target\n"
+		"FROM files\n";
 	res = PQexec(conn, sql);
 
 	if (PQresultStatus(res) != PGRES_TUPLES_OK)
@@ -171,7 +169,7 @@ libpqProcessFileList(void)
 				 PQresultErrorMessage(res));
 
 	/* sanity check the result set */
-	if (PQnfields(res) != 4)
+	if (PQnfields(res) != 6)
 		pg_fatal("unexpected result set while fetching file list\n");
 
 	/* Read result to local variables */
@@ -180,10 +178,14 @@ libpqProcessFileList(void)
 		char	   *path = PQgetvalue(res, i, 0);
 		int			filesize = atoi(PQgetvalue(res, i, 1));
 		bool		isdir = (strcmp(PQgetvalue(res, i, 2), "t") == 0);
-		char	   *link_target = PQgetvalue(res, i, 3);
+		bool		islink = (strcmp(PQgetvalue(res, i, 3), "t") == 0);
+		bool		istblspc = (strcmp(PQgetvalue(res, i, 4), "t") == 0);
+		char	   *link_target = PQgetvalue(res, i, 5);
 		file_type_t type;
 
-		if (link_target[0])
+		if (istblspc)
+			type = FILE_TYPE_TABLESPACE;
+		else if (islink)
 			type = FILE_TYPE_SYMLINK;
 		else if (isdir)
 			type = FILE_TYPE_DIRECTORY;
@@ -259,8 +261,7 @@ receiveFileChunks(const char *sql)
 		}
 
 		if (PQgetisnull(res, 0, 0) ||
-			PQgetisnull(res, 0, 1) ||
-			PQgetisnull(res, 0, 2))
+			PQgetisnull(res, 0, 1))
 		{
 			pg_fatal("unexpected null values in result while fetching remote files\n");
 		}
@@ -278,6 +279,20 @@ receiveFileChunks(const char *sql)
 		memcpy(filename, PQgetvalue(res, 0, 0), filenamelen);
 		filename[filenamelen] = '\0';
 
+		/*
+		 * It may be possible that a file has been deleted on remote side after
+		 * creating the file map. In this case simply ignore it and move on.
+		 */
+		if (PQgetisnull(res, 0, 2))
+		{
+			pg_log(PG_DEBUG,
+				   "received NULL chunk for file \"%s\", file has been deleted\n",
+				   filename);
+			pg_free(filename);
+			PQclear(res);
+			continue;
+		}
+
 		chunk = PQgetvalue(res, 0, 2);
 
 		pg_log(PG_DEBUG, "received chunk for file \"%s\", offset %d, size %d\n",
@@ -445,7 +460,7 @@ libpq_executeFileMap(filemap_t *map)
 	 */
 	sql =
 		"SELECT path, begin, \n"
-		"  pg_read_binary_file(path, begin, len) AS chunk\n"
+		"  pg_read_binary_file(path, begin, len, true) AS chunk\n"
 		"FROM fetchchunks\n";
 
 	receiveFileChunks(sql);
-- 
2.4.4

#21Michael Paquier
michael.paquier@gmail.com
In reply to: Michael Paquier (#20)
1 attachment(s)
Re: pg_rewind failure by file deletion in source server

On Wed, Jun 24, 2015 at 3:43 PM, Michael Paquier wrote:

I just realized another problem: We recently learned the hard way that some
people have files in the data directory that are not writeable by the
'postgres' user
(/messages/by-id/20150523172627.GA24277@msg.df7cb.de).
pg_rewind will try to overwrite all files it doesn't recognize as relation
files, so it's going to fail on those. A straightforward fix would be to
first open the destination file in read-only mode, and compare its contents,
and only open the file in write mode if it has changed. It would still fail
when the files really differ, but I think that's acceptable.

If I am missing nothing, two code paths need to be patched here:
copy_file_range and receiveFileChunks. copy_file_range is
straight-forward. Now wouldn't it be better to write the contents into
a temporary file, compare their content, and then switch if necessary
for receiveFileChunks?

After sleeping on it, I have been looking at this issue again and came
up with the patch attached. Instead of checking if the content of the
target and the source file are the same, meaning that we would still
need to fetch chunk content from the server in stream mode, I think
that it is more robust to check if the target file can be opened and
check for EACCES on failure, bypassing it if process does not have
permissions on it. the patch contains a test case as well, and is
independent on the rest sent upthread.
Thoughts?
--
Michael

Attachments:

20150625_pgrewind_readonly_error.patchtext/x-patch; charset=US-ASCII; name=20150625_pgrewind_readonly_error.patchDownload
diff --git a/src/bin/pg_rewind/copy_fetch.c b/src/bin/pg_rewind/copy_fetch.c
index 224fad1..ef830ac 100644
--- a/src/bin/pg_rewind/copy_fetch.c
+++ b/src/bin/pg_rewind/copy_fetch.c
@@ -174,7 +174,8 @@ copy_file_range(const char *path, off_t begin, off_t end, bool trunc)
 	if (lseek(srcfd, begin, SEEK_SET) == -1)
 		pg_fatal("could not seek in source file: %s\n", strerror(errno));
 
-	open_target_file(path, trunc);
+	if (!open_target_file(path, trunc))
+		return;
 
 	while (end - begin > 0)
 	{
diff --git a/src/bin/pg_rewind/file_ops.c b/src/bin/pg_rewind/file_ops.c
index c2d8aa1..1f68855 100644
--- a/src/bin/pg_rewind/file_ops.c
+++ b/src/bin/pg_rewind/file_ops.c
@@ -40,17 +40,17 @@ static void remove_target_symlink(const char *path);
  * Open a target file for writing. If 'trunc' is true and the file already
  * exists, it will be truncated.
  */
-void
+bool
 open_target_file(const char *path, bool trunc)
 {
 	int			mode;
 
 	if (dry_run)
-		return;
+		return true;
 
 	if (dstfd != -1 && !trunc &&
 		strcmp(path, &dstpath[strlen(datadir_target) + 1]) == 0)
-		return;					/* already open */
+		return true;				/* already open */
 
 	close_target_file();
 
@@ -60,9 +60,19 @@ open_target_file(const char *path, bool trunc)
 	if (trunc)
 		mode |= O_TRUNC;
 	dstfd = open(dstpath, mode, 0600);
+
+	/* ignore errors for unreadable files */
+	if (dstfd < 0 && errno == EACCES)
+	{
+		pg_log(PG_PROGRESS, "could not open target file \"%s\", bypassing: %s\n",
+			   dstpath, strerror(errno));
+		return false;
+	}
 	if (dstfd < 0)
 		pg_fatal("could not open target file \"%s\": %s\n",
 				 dstpath, strerror(errno));
+
+	return true;
 }
 
 /*
diff --git a/src/bin/pg_rewind/file_ops.h b/src/bin/pg_rewind/file_ops.h
index f68c71d..e437cb1 100644
--- a/src/bin/pg_rewind/file_ops.h
+++ b/src/bin/pg_rewind/file_ops.h
@@ -12,7 +12,7 @@
 
 #include "filemap.h"
 
-extern void open_target_file(const char *path, bool trunc);
+extern bool open_target_file(const char *path, bool trunc);
 extern void write_target_range(char *buf, off_t begin, size_t size);
 extern void close_target_file(void);
 extern void truncate_target_file(const char *path, off_t newsize);
diff --git a/src/bin/pg_rewind/libpq_fetch.c b/src/bin/pg_rewind/libpq_fetch.c
index df71069..41c096b 100644
--- a/src/bin/pg_rewind/libpq_fetch.c
+++ b/src/bin/pg_rewind/libpq_fetch.c
@@ -283,7 +283,8 @@ receiveFileChunks(const char *sql)
 		pg_log(PG_DEBUG, "received chunk for file \"%s\", offset %d, size %d\n",
 			   filename, chunkoff, chunksize);
 
-		open_target_file(filename, false);
+		if (!open_target_file(filename, false))
+			continue;
 
 		write_target_range(chunk, chunkoff, chunksize);
 
@@ -406,8 +407,8 @@ libpq_executeFileMap(filemap_t *map)
 
 			case FILE_ACTION_COPY:
 				/* Truncate the old file out of the way, if any */
-				open_target_file(entry->path, true);
-				fetch_file_range(entry->path, 0, entry->newsize);
+				if (open_target_file(entry->path, true))
+					fetch_file_range(entry->path, 0, entry->newsize);
 				break;
 
 			case FILE_ACTION_TRUNCATE:
diff --git a/src/bin/pg_rewind/pg_rewind.c b/src/bin/pg_rewind/pg_rewind.c
index 032301f..77dddbe 100644
--- a/src/bin/pg_rewind/pg_rewind.c
+++ b/src/bin/pg_rewind/pg_rewind.c
@@ -496,7 +496,8 @@ createBackupLabel(XLogRecPtr startpoint, TimeLineID starttli, XLogRecPtr checkpo
 		pg_fatal("backup label buffer too small\n");	/* shouldn't happen */
 
 	/* TODO: move old file out of the way, if any. */
-	open_target_file("backup_label", true);		/* BACKUP_LABEL_FILE */
+	if (!open_target_file("backup_label", true))		/* BACKUP_LABEL_FILE */
+		pg_fatal("failed update of backup_label\n");
 	write_target_range(buf, 0, len);
 }
 
@@ -557,7 +558,8 @@ updateControlFile(ControlFileData *ControlFile)
 	memset(buffer, 0, PG_CONTROL_SIZE);
 	memcpy(buffer, ControlFile, sizeof(ControlFileData));
 
-	open_target_file("global/pg_control", false);
+	if (!open_target_file("global/pg_control", false))
+		pg_fatal("failed update of global/pg_control");
 
 	write_target_range(buffer, 0, PG_CONTROL_SIZE);
 
diff --git a/src/bin/pg_rewind/t/003_extrafiles.pl b/src/bin/pg_rewind/t/003_extrafiles.pl
index 9a95268..95b42de 100644
--- a/src/bin/pg_rewind/t/003_extrafiles.pl
+++ b/src/bin/pg_rewind/t/003_extrafiles.pl
@@ -51,6 +51,17 @@ sub run_test
 	  "$test_master_datadir/tst_master_dir/master_subdir/master_file3",
 	  "in master3";
 
+	# Create a read-only file on both nodes
+	append_to_file
+	  "$test_standby_datadir/read_only_file",
+	  "read-only file in both4";
+	chmod 0400, "$test_standby_datadir/read_only_file";
+	append_to_file
+	  "$test_master_datadir/read_only_file",
+	  "read-only file in both4";
+	chmod 0400, "$test_standby_datadir/read_only_file";
+	chmod 0400, "$test_master_datadir/read_only_file";
+
 	RewindTest::promote_standby();
 	RewindTest::run_pg_rewind($test_mode);
 
#22Robert Haas
robertmhaas@gmail.com
In reply to: Michael Paquier (#21)
Re: pg_rewind failure by file deletion in source server

On Wed, Jun 24, 2015 at 9:41 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Wed, Jun 24, 2015 at 3:43 PM, Michael Paquier wrote:

I just realized another problem: We recently learned the hard way that some
people have files in the data directory that are not writeable by the
'postgres' user
(/messages/by-id/20150523172627.GA24277@msg.df7cb.de).
pg_rewind will try to overwrite all files it doesn't recognize as relation
files, so it's going to fail on those. A straightforward fix would be to
first open the destination file in read-only mode, and compare its contents,
and only open the file in write mode if it has changed. It would still fail
when the files really differ, but I think that's acceptable.

If I am missing nothing, two code paths need to be patched here:
copy_file_range and receiveFileChunks. copy_file_range is
straight-forward. Now wouldn't it be better to write the contents into
a temporary file, compare their content, and then switch if necessary
for receiveFileChunks?

After sleeping on it, I have been looking at this issue again and came
up with the patch attached. Instead of checking if the content of the
target and the source file are the same, meaning that we would still
need to fetch chunk content from the server in stream mode, I think
that it is more robust to check if the target file can be opened and
check for EACCES on failure, bypassing it if process does not have
permissions on it. the patch contains a test case as well, and is
independent on the rest sent upthread.
Thoughts?

That seems scary as heck to me. Suppose that you run pg_rewind and
SELinux decides, due to some labeling problem, to deny access to some
files. So we just skip those. Then the user tries to start the
server and maybe it works (since the postgres executable is labeled
differently) or maybe it fails and the user runs restorecon and then
it works. Kaboom, your database is corrupt.

I realize that the recent fsync fiasco demonstrated that people keep
files not readable by PG in the data directory, but that still seems
loopy to me. I realize that we can't de-support that in the back
branches because people have existing configurations that we can't
just break. But maybe we should say, look, that's just not compatible
with pg_rewind, because where the integrity of your data is concerned
"oops, I can't read this, that's probably OK" just does not seem good
enough.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#23Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#22)
Re: pg_rewind failure by file deletion in source server

On 2015-06-26 15:07:59 -0400, Robert Haas wrote:

I realize that the recent fsync fiasco demonstrated that people keep
files not readable by PG in the data directory

It wasn't unreadable files that were the primary problem, it was files
with read only permissions, no?

"oops, I can't read this, that's probably OK" just does not seem good
enough.

Agreed.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#24Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#23)
Re: pg_rewind failure by file deletion in source server

On Fri, Jun 26, 2015 at 3:10 PM, Andres Freund <andres@anarazel.de> wrote:

On 2015-06-26 15:07:59 -0400, Robert Haas wrote:

I realize that the recent fsync fiasco demonstrated that people keep
files not readable by PG in the data directory

It wasn't unreadable files that were the primary problem, it was files
with read only permissions, no?

Yes, that's true.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#25Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Michael Paquier (#20)
Re: pg_rewind failure by file deletion in source server

On 06/24/2015 09:43 AM, Michael Paquier wrote:

Attached is a new set of patches. Except for the last ones that
addresses one issue of pg_rewind (symlink management when streaming
PGDATA), all the others introduce if_not_exists options for the
functions of genfile.c. The pg_rewind stuff could be more polished
though. Feel free to comment.

I've committed the additional option to the functions in genfile.c (I
renamed it to "missing_ok", for clarity), and the pg_rewind changes to
use that option.

I ended up refactoring the patch quite a bit, so if you could
double-check what I committed to make sure I didn't break anything, that
would be great.

I didn't commit the tablespace or symlink handling changes yet, will
review those separately.

I also didn't commit the new regression test yet. It would indeed be
nice to have one, but I think it was a few bricks shy of a load. It
should work in a freshly initdb'd system, but not necessarily on an
existing installation. First, it relied on the fact that
postgresql.conf.auto exists, but a DBA might remove that if he wants to
make sure the feature is not used. Secondly, it relied on the fact that
pg_twophase is empty, but there is no guarantee of that either. Third,
the error messages included in the expected output, e.g "No such file or
directory", depend on the operating system and locale. And finally, it'd
be nice to test more things, in particular the behaviour of different
offsets and lengths to pg_read_binary_file(), although an incomplete
test would be better than no test at all.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#26Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Michael Paquier (#8)
Re: pg_rewind failure by file deletion in source server

On 06/26/2015 10:10 PM, Andres Freund wrote:

On 2015-06-26 15:07:59 -0400, Robert Haas wrote:

I realize that the recent fsync fiasco demonstrated that people keep
files not readable by PG in the data directory

It wasn't unreadable files that were the primary problem, it was files
with read only permissions, no?

Right.

"oops, I can't read this, that's probably OK" just does not seem good
enough.

Agreed.

After thinking about this some more, I think it'd be acceptable if we
just fail, if there are any non-writeable files in the data directory.
The typical scenario is that postgresql.conf, or an SSL cert file, is a
symlink to outside the data directory. It seems reasonable to require
that the DBA just removes the symlink before running pg_rewind, and
restores it afterwards if appropriate. In many cases, you would *not*
want to overwrite your config files, SSL cert files, etc., so the DBA
will need to script backing up and restoring those anyway.

(It's a fair question whether pg_rewind should be copying those files in
the first place. I've opted for "yes", so that it's easy to explain the
behaviour of pg_rewind: the end result is the same as if you took a new
base backup from the source server. Whatever files you want to backup up
before you re-initialize from a base backup, you should also backup with
pg_rewind.)

But we'll still need to handle the pg_xlog symlink case somehow. Perhaps
it would be enough to special-case pg_xlog for now.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#27Michael Paquier
michael.paquier@gmail.com
In reply to: Heikki Linnakangas (#25)
5 attachment(s)
Re: pg_rewind failure by file deletion in source server

On Mon, Jun 29, 2015 at 3:46 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 06/24/2015 09:43 AM, Michael Paquier wrote:

Attached is a new set of patches. Except for the last ones that
addresses one issue of pg_rewind (symlink management when streaming
PGDATA), all the others introduce if_not_exists options for the
functions of genfile.c. The pg_rewind stuff could be more polished
though. Feel free to comment.

I've committed the additional option to the functions in genfile.c (I
renamed it to "missing_ok", for clarity), and the pg_rewind changes to use
that option.

I ended up refactoring the patch quite a bit, so if you could double-check
what I committed to make sure I didn't break anything, that would be great.

Thanks, the new patch looks far better than what I did, I noticed a
couple of typos in the docs though:
- s/behaviour/behavior, "behavior" is American English spelling, and
it is the one used elsewhere as well, hence I guess that it makes
sense to our it.
- s/an non-existent/a non-existent
- pg_proc.h is still using if_not_exists as in my patch (you corrected
it to use missing_ok).
Those are fixed as 0001 in the attached set.

I didn't commit the tablespace or symlink handling changes yet, will review
those separately.

Thanks. Attached are rebased versions that take into account your
previous changes as well (like the variable renaming, wrapper function
usage, etc). I also added missing_ok to pg_readlink for consistency,
and rebased the fix of pg_rewind for soft links with 0005. Note that
0005 does not use pg_tablespace_location(), still the patch series
include an implementation of missing_ok for it. Feel free to use it if
necessary.

I also didn't commit the new regression test yet. It would indeed be nice to
have one, but I think it was a few bricks shy of a load. It should work in a
freshly initdb'd system, but not necessarily on an existing installation.
First, it relied on the fact that postgresql.conf.auto exists, but a DBA
might remove that if he wants to make sure the feature is not used.
Secondly, it relied on the fact that pg_twophase is empty, but there is no
guarantee of that either. Third, the error messages included in the expected
output, e.g "No such file or directory", depend on the operating system and
locale. And finally, it'd be nice to test more things, in particular the
behaviour of different offsets and lengths to pg_read_binary_file(),
although an incomplete test would be better than no test at all.

Portability is going to really reduce the test range, the only things
that we could test are:
- NULL results returned when missing_ok = true (with a dummy file
name/link/directory) as missing_ok = false would choke depending on
the platform as you mentioned.
- Ensure that those functions called by users without superuser rights
fail properly.
- Path format errors for each function, like that:
=# select pg_ls_dir('..');
ERROR: 42501: path must be in or below the current directory
LOCATION: convert_and_check_filename, genfile.c:78
For tests on pg_read_binary_file, do you think that there is one file
of PGDATA that we could use for scanning? I cannot think of one on the
top of my mind now (postgresql.conf or any configuration files could
be overridden by the user so they are out of scope, PG_VERSION is an
additional maintenance burden). Well, I think that those things are
still worth testing in any case and I think that you think so as well.
If you are fine with that I could wrap up a patch that's better than
nothing done for sure. Thoughts?

Now we could have a TAP test for this stuff, where we could have a
custom PGDATA with some dummy files that we know will be empty for
example, still it seems like an overkill to me for this purpose,
initdb is rather costly in this facility.. And on small machines.
Regards,
--
Michael

Attachments:

0001-Fix-some-typos-and-an-incorrect-naming-introduced-by.patchtext/x-patch; charset=US-ASCII; name=0001-Fix-some-typos-and-an-incorrect-naming-introduced-by.patchDownload
From 378bd8af5c27e7e20b038f05b693b95e9871ef0b Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@otacoo.com>
Date: Mon, 29 Jun 2015 13:38:03 +0900
Subject: [PATCH 1/5] Fix some typos and an incorrect naming introduced by
 cb2acb1

---
 doc/src/sgml/func.sgml        | 4 ++--
 src/include/catalog/pg_proc.h | 2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index d49cd43..8b28e29 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -17851,7 +17851,7 @@ postgres=# SELECT * FROM pg_xlogfile_name_offset(pg_stop_backup());
 
    <para>
     All of these functions take an optional <parameter>missing_ok</> parameter,
-    which specifies the behaviour when the file or directory does not exist.
+    which specifies the behavior when the file or directory does not exist.
     If <literal>true</literal>, the function returns NULL (except
     <function>pg_ls_dir</>, which returns an empty result set). If
     <literal>false</>, an error is raised. The default is <literal>false</>.
@@ -17867,7 +17867,7 @@ postgres=# SELECT * FROM pg_xlogfile_name_offset(pg_stop_backup());
     included in the result set. The default is to exclude them
     (<literal>false</>), but including them can be useful when
     <parameter>missing_ok</> is <literal>true</literal>, to distinguish an
-    empty directory from an non-existent directory.
+    empty directory from a non-existent directory.
    </para>
 
    <indexterm>
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index 3a40fa6..be3a8fb 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -3183,7 +3183,7 @@ DESCR("rotate log file");
 
 DATA(insert OID = 2623 ( pg_stat_file		PGNSP PGUID 12 1 0 0 0 f f f f t f v 1 0 2249 "25" "{25,20,1184,1184,1184,1184,16}" "{i,o,o,o,o,o,o}" "{filename,size,access,modification,change,creation,isdir}" _null_ _null_ pg_stat_file_1arg _null_ _null_ _null_ ));
 DESCR("get information about file");
-DATA(insert OID = 3307 ( pg_stat_file		PGNSP PGUID 12 1 0 0 0 f f f f t f v 2 0 2249 "25 16" "{25,16,20,1184,1184,1184,1184,16}" "{i,i,o,o,o,o,o,o}" "{filename,if_not_exists,size,access,modification,change,creation,isdir}" _null_ _null_ pg_stat_file _null_ _null_ _null_ ));
+DATA(insert OID = 3307 ( pg_stat_file		PGNSP PGUID 12 1 0 0 0 f f f f t f v 2 0 2249 "25 16" "{25,16,20,1184,1184,1184,1184,16}" "{i,i,o,o,o,o,o,o}" "{filename,missing_ok,size,access,modification,change,creation,isdir}" _null_ _null_ pg_stat_file _null_ _null_ _null_ ));
 DESCR("get information about file");
 DATA(insert OID = 2624 ( pg_read_file		PGNSP PGUID 12 1 0 0 0 f f f f t f v 3 0 25 "25 20 20" _null_ _null_ _null_ _null_ _null_ pg_read_file_off_len _null_ _null_ _null_ ));
 DESCR("read text from a file");
-- 
2.4.5

0002-Add-new-column-islink-in-pg_stat_file.patchtext/x-patch; charset=US-ASCII; name=0002-Add-new-column-islink-in-pg_stat_file.patchDownload
From 9ae79930461a9024d01a878bdf63bfc4cd143c0d Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@otacoo.com>
Date: Wed, 24 Jun 2015 11:26:02 +0900
Subject: [PATCH 2/5] Add new column islink in pg_stat_file

This allows to detect if the file path evaluated is a symlink or not.
pg_stat_file is switched to use as well lstat to enable soft link
detection.
---
 doc/src/sgml/func.sgml          |  5 +++--
 src/backend/utils/adt/genfile.c | 15 +++++++++++----
 src/include/catalog/pg_proc.h   |  5 ++---
 3 files changed, 16 insertions(+), 9 deletions(-)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 8b28e29..67e1626 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -17905,8 +17905,9 @@ SELECT convert_from(pg_read_binary_file('file_in_utf8.txt'), 'UTF8');
     <function>pg_stat_file</> returns a record containing the file
     size, last accessed time stamp, last modified time stamp,
     last file status change time stamp (Unix platforms only),
-    file creation time stamp (Windows only), and a <type>boolean</type>
-    indicating if it is a directory.  Typical usages include:
+    file creation time stamp (Windows only), a <type>boolean</type>
+    indicating if it is a directory, and a <type>boolean</type>
+    indicating if it is a symbolic link.  Typical usages include:
 <programlisting>
 SELECT * FROM pg_stat_file('filename');
 SELECT (pg_stat_file('filename')).modification;
diff --git a/src/backend/utils/adt/genfile.c b/src/backend/utils/adt/genfile.c
index c4eb10d..df8ae88 100644
--- a/src/backend/utils/adt/genfile.c
+++ b/src/backend/utils/adt/genfile.c
@@ -306,8 +306,8 @@ pg_stat_file(PG_FUNCTION_ARGS)
 	text	   *filename_t = PG_GETARG_TEXT_P(0);
 	char	   *filename;
 	struct stat fst;
-	Datum		values[6];
-	bool		isnull[6];
+	Datum		values[7];
+	bool		isnull[7];
 	HeapTuple	tuple;
 	TupleDesc	tupdesc;
 	bool		missing_ok = false;
@@ -323,7 +323,7 @@ pg_stat_file(PG_FUNCTION_ARGS)
 
 	filename = convert_and_check_filename(filename_t);
 
-	if (stat(filename, &fst) < 0)
+	if (lstat(filename, &fst) < 0)
 	{
 		if (missing_ok && errno == ENOENT)
 			PG_RETURN_NULL();
@@ -337,7 +337,7 @@ pg_stat_file(PG_FUNCTION_ARGS)
 	 * This record type had better match the output parameters declared for me
 	 * in pg_proc.h.
 	 */
-	tupdesc = CreateTemplateTupleDesc(6, false);
+	tupdesc = CreateTemplateTupleDesc(7, false);
 	TupleDescInitEntry(tupdesc, (AttrNumber) 1,
 					   "size", INT8OID, -1, 0);
 	TupleDescInitEntry(tupdesc, (AttrNumber) 2,
@@ -350,6 +350,8 @@ pg_stat_file(PG_FUNCTION_ARGS)
 					   "creation", TIMESTAMPTZOID, -1, 0);
 	TupleDescInitEntry(tupdesc, (AttrNumber) 6,
 					   "isdir", BOOLOID, -1, 0);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 7,
+					   "islink", BOOLOID, -1, 0);
 	BlessTupleDesc(tupdesc);
 
 	memset(isnull, false, sizeof(isnull));
@@ -366,6 +368,11 @@ pg_stat_file(PG_FUNCTION_ARGS)
 	values[4] = TimestampTzGetDatum(time_t_to_timestamptz(fst.st_ctime));
 #endif
 	values[5] = BoolGetDatum(S_ISDIR(fst.st_mode));
+#ifndef WIN32
+	values[6] = BoolGetDatum(S_ISLNK(fst.st_mode));
+#else
+	values[6] = BoolGetDatum(pgwin32_is_junction(filename));
+#endif
 
 	tuple = heap_form_tuple(tupdesc, values, isnull);
 
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index be3a8fb..f6abd74 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -3180,10 +3180,9 @@ DATA(insert OID = 2621 ( pg_reload_conf			PGNSP PGUID 12 1 0 0 0 f f f f t f v 0
 DESCR("reload configuration files");
 DATA(insert OID = 2622 ( pg_rotate_logfile		PGNSP PGUID 12 1 0 0 0 f f f f t f v 0 0 16 "" _null_ _null_ _null_ _null_ _null_ pg_rotate_logfile _null_ _null_ _null_ ));
 DESCR("rotate log file");
-
-DATA(insert OID = 2623 ( pg_stat_file		PGNSP PGUID 12 1 0 0 0 f f f f t f v 1 0 2249 "25" "{25,20,1184,1184,1184,1184,16}" "{i,o,o,o,o,o,o}" "{filename,size,access,modification,change,creation,isdir}" _null_ _null_ pg_stat_file_1arg _null_ _null_ _null_ ));
+DATA(insert OID = 2623 ( pg_stat_file		PGNSP PGUID 12 1 0 0 0 f f f f t f v 1 0 2249 "25" "{25,20,1184,1184,1184,1184,16,16}" "{i,o,o,o,o,o,o,o}" "{filename,size,access,modification,change,creation,isdir,islink}" _null_ _null_ pg_stat_file_1arg _null_ _null_ _null_ ));
 DESCR("get information about file");
-DATA(insert OID = 3307 ( pg_stat_file		PGNSP PGUID 12 1 0 0 0 f f f f t f v 2 0 2249 "25 16" "{25,16,20,1184,1184,1184,1184,16}" "{i,i,o,o,o,o,o,o}" "{filename,missing_ok,size,access,modification,change,creation,isdir}" _null_ _null_ pg_stat_file _null_ _null_ _null_ ));
+DATA(insert OID = 3307 ( pg_stat_file		PGNSP PGUID 12 1 0 0 0 f f f f t f v 2 0 2249 "25 16" "{25,16,20,1184,1184,1184,1184,16,16}" "{i,i,o,o,o,o,o,o,o}" "{filename,missing_ok,size,access,modification,change,creation,isdir,islink}" _null_ _null_ pg_stat_file _null_ _null_ _null_ ));
 DESCR("get information about file");
 DATA(insert OID = 2624 ( pg_read_file		PGNSP PGUID 12 1 0 0 0 f f f f t f v 3 0 25 "25 20 20" _null_ _null_ _null_ _null_ _null_ pg_read_file_off_len _null_ _null_ _null_ ));
 DESCR("read text from a file");
-- 
2.4.5

0003-Add-pg_readlink-to-get-value-of-a-symbolic-link.patchtext/x-patch; charset=US-ASCII; name=0003-Add-pg_readlink-to-get-value-of-a-symbolic-link.patchDownload
From 0ea59fb774e61e60def14f5a3470429695ad6adc Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@otacoo.com>
Date: Mon, 29 Jun 2015 14:40:20 +0900
Subject: [PATCH 3/5] Add pg_readlink to get value of a symbolic link

This is similar to Posix's readlink, except that PostgreSQL restrictions
are applied to the source path queried, similarly to other functions of
genfile.c.
---
 doc/src/sgml/func.sgml          |  9 ++++++
 src/backend/utils/adt/genfile.c | 66 +++++++++++++++++++++++++++++++++++++++++
 src/include/catalog/pg_proc.h   |  4 +++
 src/include/utils/builtins.h    |  2 ++
 4 files changed, 81 insertions(+)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 67e1626..80a4f26 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -17845,6 +17845,15 @@ postgres=# SELECT * FROM pg_xlogfile_name_offset(pg_stop_backup());
         Return information about a file.
        </entry>
       </row>
+      <row>
+       <entry>
+        <literal><function>pg_readlink(<parameter>filename</> <type>text</>[, <parameter>missing_ok</> <type>boolean</type>])</function></literal>
+       </entry>
+       <entry><type>text</type></entry>
+       <entry>
+        Return value of a symbolic link.
+       </entry>
+      </row>
      </tbody>
     </tgroup>
    </table>
diff --git a/src/backend/utils/adt/genfile.c b/src/backend/utils/adt/genfile.c
index df8ae88..a0f8d7b 100644
--- a/src/backend/utils/adt/genfile.c
+++ b/src/backend/utils/adt/genfile.c
@@ -480,3 +480,69 @@ pg_ls_dir_1arg(PG_FUNCTION_ARGS)
 {
 	return pg_ls_dir(fcinfo);
 }
+
+
+/*
+ * pg_readlink - Find the real location of the provided symbolic link.
+ */
+Datum
+pg_readlink(PG_FUNCTION_ARGS)
+{
+	text	   *sourcepath_t = PG_GETARG_TEXT_P(0);
+	char	   *sourcepath;
+	char        targetpath[MAXPGPATH];
+	int			rllen;
+	bool		missing_ok = false;
+
+	if (!superuser())
+		ereport(ERROR,
+				(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+				 (errmsg("must be superuser to read files"))));
+
+	if (PG_NARGS() == 2)
+		missing_ok = PG_GETARG_BOOL(1);
+
+#if defined(HAVE_READLINK) || defined(WIN32)
+	sourcepath = convert_and_check_filename(sourcepath_t);
+
+	rllen = readlink(sourcepath, targetpath, sizeof(targetpath));
+	if (rllen < 0)
+	{
+		if (missing_ok && errno == ENOENT)
+			PG_RETURN_NULL();
+		else
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read symbolic link \"%s\": %m",
+							sourcepath)));
+	}
+	if (rllen >= sizeof(targetpath))
+	{
+		ereport(ERROR,
+				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
+				 errmsg("symbolic link \"%s\" target is too long",
+						sourcepath)));
+	}
+	targetpath[rllen] = '\0';
+
+	PG_RETURN_TEXT_P(cstring_to_text(targetpath));
+#else
+	ereport(ERROR,
+			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			 errmsg("symbolic links are not supported by this platform")));
+	PG_RETURN_NULL();
+#endif
+}
+
+/*
+ * pg_readlink (1 argument version)
+ *
+ * note: this wrapper is necessary to pass the sanity check in opr_sanity,
+ * which checks that all built-in functions that share the implementing C
+ * function take the same number of arguments
+ */
+Datum
+pg_readlink_1arg(PG_FUNCTION_ARGS)
+{
+	return pg_readlink(fcinfo);
+}
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index f6abd74..f9c6176 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -3200,6 +3200,10 @@ DATA(insert OID = 2625 ( pg_ls_dir			PGNSP PGUID 12 1 1000 0 0 f f f f t t v 1 0
 DESCR("list all files in a directory");
 DATA(insert OID = 3297 ( pg_ls_dir			PGNSP PGUID 12 1 1000 0 0 f f f f t t v 3 0 25 "25 16 16" _null_ _null_ _null_ _null_ _null_ pg_ls_dir _null_ _null_ _null_ ));
 DESCR("list all files in a directory");
+DATA(insert OID = 3298 ( pg_readlink		PGNSP PGUID 12 1 0 0 0 f f f f t f v 1 0 25 "25" _null_ _null_ _null_ _null_ _null_ pg_readlink_1arg _null_ _null_ _null_ ));
+DESCR("read value of a symbolic link");
+DATA(insert OID = 3299 ( pg_readlink		PGNSP PGUID 12 1 0 0 0 f f f f t f v 2 0 25 "25 16" _null_ _null_ _null_ _null_ _null_ pg_readlink _null_ _null_ _null_ ));
+DESCR("read value of a symbolic link");
 DATA(insert OID = 2626 ( pg_sleep			PGNSP PGUID 12 1 0 0 0 f f f f t f v 1 0 2278 "701" _null_ _null_ _null_ _null_ _null_ pg_sleep _null_ _null_ _null_ ));
 DESCR("sleep for the specified time in seconds");
 DATA(insert OID = 3935 ( pg_sleep_for			PGNSP PGUID 14 1 0 0 0 f f f f t f v 1 0 2278 "1186" _null_ _null_ _null_ _null_ _null_ "select pg_catalog.pg_sleep(extract(epoch from pg_catalog.clock_timestamp() operator(pg_catalog.+) $1) operator(pg_catalog.-) extract(epoch from pg_catalog.clock_timestamp()))" _null_ _null_ _null_ ));
diff --git a/src/include/utils/builtins.h b/src/include/utils/builtins.h
index 9855672..615e5ab 100644
--- a/src/include/utils/builtins.h
+++ b/src/include/utils/builtins.h
@@ -479,6 +479,8 @@ extern Datum pg_read_binary_file_off_len(PG_FUNCTION_ARGS);
 extern Datum pg_read_binary_file_all(PG_FUNCTION_ARGS);
 extern Datum pg_ls_dir(PG_FUNCTION_ARGS);
 extern Datum pg_ls_dir_1arg(PG_FUNCTION_ARGS);
+extern Datum pg_readlink(PG_FUNCTION_ARGS);
+extern Datum pg_readlink_1arg(PG_FUNCTION_ARGS);
 
 /* misc.c */
 extern Datum current_database(PG_FUNCTION_ARGS);
-- 
2.4.5

0004-Extend-pg_tablespace_location-with-option-missing_ok.patchtext/x-patch; charset=US-ASCII; name=0004-Extend-pg_tablespace_location-with-option-missing_ok.patchDownload
From bdf72b13775b7fa71d24a92c71fc0c4002109e01 Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@otacoo.com>
Date: Mon, 29 Jun 2015 15:07:24 +0900
Subject: [PATCH 4/5] Extend pg_tablespace_location with option missing_ok

This is useful for code paths that prefer receive an empty response
instead of an ERROR if tablespace specified does not exist.
---
 doc/src/sgml/func.sgml        |  9 +++++++--
 src/backend/utils/adt/misc.c  | 30 ++++++++++++++++++++++++++----
 src/include/catalog/pg_proc.h |  2 ++
 src/include/utils/builtins.h  |  1 +
 4 files changed, 36 insertions(+), 6 deletions(-)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 80a4f26..0257df6 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -15850,9 +15850,14 @@ SELECT pg_type_is_visible('myschema.widget'::regtype);
        <entry>get the set of database OIDs that have objects in the tablespace</entry>
       </row>
       <row>
-       <entry><literal><function>pg_tablespace_location(<parameter>tablespace_oid</parameter>)</function></literal></entry>
+       <entry><literal><function>pg_tablespace_location(<parameter>tablespace_oid</parameter> [, <parameter>missing_ok</> <type>boolean</type>])</function></literal></entry>
        <entry><type>text</type></entry>
-       <entry>get the path in the file system that this tablespace is located in</entry>
+       <entry>
+        Get the path in the file system that this tablespace is located in.
+        If <parameter>missing_ok</> is <literal>true</>, the function returns
+        NULL when the tablespace does not exist. if <literal>false</>, an
+        error is raised. The default is <literal>false</>.
+       </entry>
       </row>
       <row>
        <entry><literal><function>pg_typeof(<parameter>any</parameter>)</function></literal></entry>
diff --git a/src/backend/utils/adt/misc.c b/src/backend/utils/adt/misc.c
index c0495d9..f5f61c5 100644
--- a/src/backend/utils/adt/misc.c
+++ b/src/backend/utils/adt/misc.c
@@ -347,6 +347,10 @@ pg_tablespace_location(PG_FUNCTION_ARGS)
 	char		sourcepath[MAXPGPATH];
 	char		targetpath[MAXPGPATH];
 	int			rllen;
+	bool		missing_ok = false;
+
+	if (PG_NARGS() == 2)
+		missing_ok = PG_GETARG_BOOL(1);
 
 	/*
 	 * It's useful to apply this function to pg_class.reltablespace, wherein
@@ -373,10 +377,15 @@ pg_tablespace_location(PG_FUNCTION_ARGS)
 
 	rllen = readlink(sourcepath, targetpath, sizeof(targetpath));
 	if (rllen < 0)
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not read symbolic link \"%s\": %m",
-						sourcepath)));
+	{
+		if (missing_ok && errno == ENOENT)
+			PG_RETURN_NULL();
+		else
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read symbolic link \"%s\": %m",
+							sourcepath)));
+	}
 	if (rllen >= sizeof(targetpath))
 		ereport(ERROR,
 				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
@@ -394,6 +403,19 @@ pg_tablespace_location(PG_FUNCTION_ARGS)
 }
 
 /*
+ * pg_tablespace_location (1 argument version)
+ *
+ * note: this wrapper is necessary to pass the sanity check in opr_sanity,
+ * which checks that all built-in functions that share the implementing C
+ * function take the same number of arguments.
+ */
+Datum
+pg_tablespace_location_1arg(PG_FUNCTION_ARGS)
+{
+	return pg_tablespace_location(fcinfo);
+}
+
+/*
  * pg_sleep - delay for N seconds
  */
 Datum
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index f9c6176..6ccbf9b 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -2922,6 +2922,8 @@ DESCR("current trigger depth");
 
 DATA(insert OID = 3778 ( pg_tablespace_location PGNSP PGUID 12 1 0 0 0 f f f f t f s 1 0 25 "26" _null_ _null_ _null_ _null_ _null_ pg_tablespace_location _null_ _null_ _null_ ));
 DESCR("tablespace location");
+DATA(insert OID = 3579 ( pg_tablespace_location PGNSP PGUID 12 1 0 0 0 f f f f t f s 2 0 25 "26 16" _null_ _null_ _null_ _null_ _null_ pg_tablespace_location_1arg _null_ _null_ _null_ ));
+DESCR("tablespace location");
 
 DATA(insert OID = 1946 (  encode						PGNSP PGUID 12 1 0 0 0 f f f f t f i 2 0 25 "17 25" _null_ _null_ _null_ _null_ _null_ binary_encode _null_ _null_ _null_ ));
 DESCR("convert bytea value into some ascii-only text string");
diff --git a/src/include/utils/builtins.h b/src/include/utils/builtins.h
index 615e5ab..150afb6 100644
--- a/src/include/utils/builtins.h
+++ b/src/include/utils/builtins.h
@@ -490,6 +490,7 @@ extern Datum pg_terminate_backend(PG_FUNCTION_ARGS);
 extern Datum pg_reload_conf(PG_FUNCTION_ARGS);
 extern Datum pg_tablespace_databases(PG_FUNCTION_ARGS);
 extern Datum pg_tablespace_location(PG_FUNCTION_ARGS);
+extern Datum pg_tablespace_location_1arg(PG_FUNCTION_ARGS);
 extern Datum pg_rotate_logfile(PG_FUNCTION_ARGS);
 extern Datum pg_sleep(PG_FUNCTION_ARGS);
 extern Datum pg_get_keywords(PG_FUNCTION_ARGS);
-- 
2.4.5

0005-Fix-symlink-usage-in-pg_rewind.patchtext/x-patch; charset=US-ASCII; name=0005-Fix-symlink-usage-in-pg_rewind.patchDownload
From f44c286ec129deaf120986a5025996001cfd8ffe Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@otacoo.com>
Date: Mon, 29 Jun 2015 15:09:34 +0900
Subject: [PATCH 5/5] Fix symlink usage in pg_rewind

Make use of pg_readlink to determine where a symlink in a source PGDATA
streamed is from. Also remove restrictions related to symlink comparisons
on source and target instances, those need to be only limited to
tablespaces.
---
 src/bin/pg_rewind/file_ops.c    |  2 ++
 src/bin/pg_rewind/filemap.c     | 11 +++--------
 src/bin/pg_rewind/filemap.h     |  1 +
 src/bin/pg_rewind/libpq_fetch.c | 32 ++++++++++++++++----------------
 4 files changed, 22 insertions(+), 24 deletions(-)

diff --git a/src/bin/pg_rewind/file_ops.c b/src/bin/pg_rewind/file_ops.c
index c2d8aa1..29ddb1d 100644
--- a/src/bin/pg_rewind/file_ops.c
+++ b/src/bin/pg_rewind/file_ops.c
@@ -139,6 +139,7 @@ remove_target(file_entry_t *entry)
 			remove_target_file(entry->path);
 			break;
 
+		case FILE_TYPE_TABLESPACE:
 		case FILE_TYPE_SYMLINK:
 			remove_target_symlink(entry->path);
 			break;
@@ -156,6 +157,7 @@ create_target(file_entry_t *entry)
 			create_target_dir(entry->path);
 			break;
 
+		case FILE_TYPE_TABLESPACE:
 		case FILE_TYPE_SYMLINK:
 			create_target_symlink(entry->path, entry->link_target);
 			break;
diff --git a/src/bin/pg_rewind/filemap.c b/src/bin/pg_rewind/filemap.c
index 05eff68..6ef2342 100644
--- a/src/bin/pg_rewind/filemap.c
+++ b/src/bin/pg_rewind/filemap.c
@@ -112,12 +112,6 @@ process_source_file(const char *path, file_type_t type, size_t newsize,
 	switch (type)
 	{
 		case FILE_TYPE_DIRECTORY:
-			if (exists && !S_ISDIR(statbuf.st_mode))
-			{
-				/* it's a directory in source, but not in target. Strange.. */
-				pg_fatal("\"%s\" is not a directory\n", localpath);
-			}
-
 			if (!exists)
 				action = FILE_ACTION_CREATE;
 			else
@@ -125,7 +119,7 @@ process_source_file(const char *path, file_type_t type, size_t newsize,
 			oldsize = 0;
 			break;
 
-		case FILE_TYPE_SYMLINK:
+		case FILE_TYPE_TABLESPACE:
 			if (exists &&
 #ifndef WIN32
 				!S_ISLNK(statbuf.st_mode)
@@ -140,7 +134,8 @@ process_source_file(const char *path, file_type_t type, size_t newsize,
 				 */
 				pg_fatal("\"%s\" is not a symbolic link\n", localpath);
 			}
-
+			/* fallback to default */
+		case FILE_TYPE_SYMLINK:
 			if (!exists)
 				action = FILE_ACTION_CREATE;
 			else
diff --git a/src/bin/pg_rewind/filemap.h b/src/bin/pg_rewind/filemap.h
index 9943ec5..35540d6 100644
--- a/src/bin/pg_rewind/filemap.h
+++ b/src/bin/pg_rewind/filemap.h
@@ -36,6 +36,7 @@ typedef enum
 {
 	FILE_TYPE_REGULAR,
 	FILE_TYPE_DIRECTORY,
+	FILE_TYPE_TABLESPACE,
 	FILE_TYPE_SYMLINK
 } file_type_t;
 
diff --git a/src/bin/pg_rewind/libpq_fetch.c b/src/bin/pg_rewind/libpq_fetch.c
index 05aa133..d77c931 100644
--- a/src/bin/pg_rewind/libpq_fetch.c
+++ b/src/bin/pg_rewind/libpq_fetch.c
@@ -140,30 +140,26 @@ libpqProcessFileList(void)
 	 * Create a recursive directory listing of the whole data directory.
 	 *
 	 * The WITH RECURSIVE part does most of the work. The second part gets the
-	 * targets of the symlinks in pg_tblspc directory.
-	 *
-	 * XXX: There is no backend function to get a symbolic link's target in
-	 * general, so if the admin has put any custom symbolic links in the data
-	 * directory, they won't be copied correctly.
+	 * targets of the symlinks in the data directory.
 	 */
 	sql =
-		"WITH RECURSIVE files (path, filename, size, isdir) AS (\n"
-		"  SELECT '' AS path, filename, size, isdir FROM\n"
+		"WITH RECURSIVE files (path, filename, size, isdir, islink) AS (\n"
+		"  SELECT '' AS path, filename, size, isdir, islink FROM\n"
 		"  (SELECT pg_ls_dir('.', true, false) AS filename) AS fn,\n"
 		"        pg_stat_file(fn.filename, true) AS this\n"
 		"  UNION ALL\n"
 		"  SELECT parent.path || parent.filename || '/' AS path,\n"
-		"         fn, this.size, this.isdir\n"
+		"         fn, this.size, this.isdir, this.islink\n"
 		"  FROM files AS parent,\n"
 		"       pg_ls_dir(parent.path || parent.filename, true, false) AS fn,\n"
 		"       pg_stat_file(parent.path || parent.filename || '/' || fn, true) AS this\n"
 		"       WHERE parent.isdir = 't'\n"
 		")\n"
-		"SELECT path || filename, size, isdir,\n"
-		"       pg_tablespace_location(pg_tablespace.oid) AS link_target\n"
-		"FROM files\n"
-		"LEFT OUTER JOIN pg_tablespace ON files.path = 'pg_tblspc/'\n"
-		"                             AND oid::text = files.filename\n";
+		"SELECT path || filename, size, isdir, islink,\n"
+		"       path = 'pg_tblspc/' AS istblspc,\n"
+		"       CASE WHEN islink THEN pg_readlink(path || filename)\n"
+		"            ELSE NULL END AS link_target\n"
+		"FROM files\n";
 	res = PQexec(conn, sql);
 
 	if (PQresultStatus(res) != PGRES_TUPLES_OK)
@@ -171,7 +167,7 @@ libpqProcessFileList(void)
 				 PQresultErrorMessage(res));
 
 	/* sanity check the result set */
-	if (PQnfields(res) != 4)
+	if (PQnfields(res) != 6)
 		pg_fatal("unexpected result set while fetching file list\n");
 
 	/* Read result to local variables */
@@ -180,7 +176,9 @@ libpqProcessFileList(void)
 		char	   *path = PQgetvalue(res, i, 0);
 		int			filesize = atoi(PQgetvalue(res, i, 1));
 		bool		isdir = (strcmp(PQgetvalue(res, i, 2), "t") == 0);
-		char	   *link_target = PQgetvalue(res, i, 3);
+		bool		islink = (strcmp(PQgetvalue(res, i, 3), "t") == 0);
+		bool		istblspc = (strcmp(PQgetvalue(res, i, 4), "t") == 0);
+		char	   *link_target = PQgetvalue(res, i, 5);
 		file_type_t type;
 
 		if (PQgetisnull(res, 0, 1))
@@ -192,7 +190,9 @@ libpqProcessFileList(void)
 			continue;
 		}
 
-		if (link_target[0])
+		if (istblspc)
+			type = FILE_TYPE_TABLESPACE;
+		else if (islink)
 			type = FILE_TYPE_SYMLINK;
 		else if (isdir)
 			type = FILE_TYPE_DIRECTORY;
-- 
2.4.5

#28Michael Paquier
michael.paquier@gmail.com
In reply to: Heikki Linnakangas (#26)
1 attachment(s)
Re: pg_rewind failure by file deletion in source server

On Mon, Jun 29, 2015 at 4:55 AM, Heikki Linnakangas wrote:

After thinking about this some more, I think it'd be acceptable if we just
fail, if there are any non-writeable files in the data directory. The
typical scenario is that postgresql.conf, or an SSL cert file, is a symlink
to outside the data directory. It seems reasonable to require that the DBA
just removes the symlink before running pg_rewind, and restores it
afterwards if appropriate. In many cases, you would *not* want to overwrite
your config files, SSL cert files, etc., so the DBA will need to script
backing up and restoring those anyway.

Outnumbered I surrender. We are on the way of a doc patch then as
pg_rewind already fails in this scenario when trying to open the
target. What about the attached?

(It's a fair question whether pg_rewind should be copying those files in the
first place. I've opted for "yes", so that it's easy to explain the
behaviour of pg_rewind: the end result is the same as if you took a new base
backup from the source server. Whatever files you want to backup up before
you re-initialize from a base backup, you should also backup with
pg_rewind.)

We could then ignore all the symlinks in PGDATA except the ones that
are related to tablespaces, that's where the use of missing_ok for
pg_tablespace_location makes sense. pg_readlink would be useful for
pg_xlog in any case either way.

But we'll still need to handle the pg_xlog symlink case somehow. Perhaps it
would be enough to special-case pg_xlog for now.

Well, sure, pg_rewind does not copy the soft links either way. Now it
would be nice to have an option to be able to recreate the soft link
of at least pg_xlog even if it can be scripted as well after a run.
Regards,
--
Michael

Attachments:

20150630_rewind_readonly_doc.patchtext/x-patch; charset=US-ASCII; name=20150630_rewind_readonly_doc.patchDownload
diff --git a/doc/src/sgml/ref/pg_rewind.sgml b/doc/src/sgml/ref/pg_rewind.sgml
index 32dc83f..92325f3 100644
--- a/doc/src/sgml/ref/pg_rewind.sgml
+++ b/doc/src/sgml/ref/pg_rewind.sgml
@@ -234,4 +234,15 @@ PostgreSQL documentation
   </refsect2>
  </refsect1>
 
+ <refsect1>
+  <title>Notes</title>
+
+  <para>
+   Read-only files cannot be included in the data directory of target
+   instance before starting <application>pg_rewind</>. For example, soft
+   links to read-only files should be temporarily removed from the data
+   directory and added back after the process.
+  </para>
+ </refsect1>
+
 </refentry>
#29Vladimir Borodin
root@simply.name
In reply to: Heikki Linnakangas (#25)
Re: pg_rewind failure by file deletion in source server

28 июня 2015 г., в 21:46, Heikki Linnakangas <hlinnaka@iki.fi> написал(а):

On 06/24/2015 09:43 AM, Michael Paquier wrote:

Attached is a new set of patches. Except for the last ones that
addresses one issue of pg_rewind (symlink management when streaming
PGDATA), all the others introduce if_not_exists options for the
functions of genfile.c. The pg_rewind stuff could be more polished
though. Feel free to comment.

I've committed the additional option to the functions in genfile.c (I renamed it to "missing_ok", for clarity), and the pg_rewind changes to use that option.

And since it changes API it would not be back-ported to 9.4, right?

I ended up refactoring the patch quite a bit, so if you could double-check what I committed to make sure I didn't break anything, that would be great.

I didn't commit the tablespace or symlink handling changes yet, will review those separately.

I also didn't commit the new regression test yet. It would indeed be nice to have one, but I think it was a few bricks shy of a load. It should work in a freshly initdb'd system, but not necessarily on an existing installation. First, it relied on the fact that postgresql.conf.auto exists, but a DBA might remove that if he wants to make sure the feature is not used. Secondly, it relied on the fact that pg_twophase is empty, but there is no guarantee of that either. Third, the error messages included in the expected output, e.g "No such file or directory", depend on the operating system and locale. And finally, it'd be nice to test more things, in particular the behaviour of different offsets and lengths to pg_read_binary_file(), although an incomplete test would be better than no test at all.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

--
May the force be with you…
https://simply.name

#30Michael Paquier
michael.paquier@gmail.com
In reply to: Vladimir Borodin (#29)
Re: pg_rewind failure by file deletion in source server

On Mon, Jun 29, 2015 at 3:53 PM, Vladimir Borodin <root@simply.name> wrote:

28 июня 2015 г., в 21:46, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

I've committed the additional option to the functions in genfile.c (I
renamed it to "missing_ok", for clarity), and the pg_rewind changes to use
that option.

And since it changes API it would not be back-ported to 9.4, right?

Those API changes are not going to be back-patched to 9.4 as this is
basically a new feature. For 9.4's pg_rewind, the problem does not
concern this mailing list anyway, so let's discuss it directly on the
project page on github. Just mentioning, but I am afraid that we are
going to need a set of TRY/CATCH blocks with a wrapper function in
plpgsql that does the failure legwork done here, or to export those
functions in an extension with some copy/paste as all the necessary
routines of genfile.c are not exposed externally, and to be sure that
those functions are installed on the source server. I am not sure that
I will be able to look at that in the short term unfortunately...
Hence patches are welcome there and I will happily review, argue or
integrate them if needed. The non-streaming option is not impacted in
any case, so it's not like pg_rewind cannot be used at all on 9.4 or
9.3 instances.
Regards,
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#31Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Michael Paquier (#28)
1 attachment(s)
Re: pg_rewind failure by file deletion in source server

On 06/29/2015 09:44 AM, Michael Paquier wrote:

On Mon, Jun 29, 2015 at 4:55 AM, Heikki Linnakangas wrote:

But we'll still need to handle the pg_xlog symlink case somehow. Perhaps it
would be enough to special-case pg_xlog for now.

Well, sure, pg_rewind does not copy the soft links either way. Now it
would be nice to have an option to be able to recreate the soft link
of at least pg_xlog even if it can be scripted as well after a run.

Hmm. I'm starting to think that pg_rewind should ignore pg_xlog
entirely. In any non-trivial scenarios, just copying all the files from
pg_xlog isn't enough anyway, and you need to set up a recovery.conf
after running pg_rewind that contains a restore_command or
primary_conninfo, to fetch the WAL. So you can argue that by not copying
pg_xlog automatically, we're actually doing a favour to the DBA, by
forcing him to set up the recovery.conf file correctly. Because if you
just test simple scenarios where not much time has passed between the
failover and running pg_rewind, it might be enough to just copy all the
WAL currently in pg_xlog, but it would not be enough if more time had
passed and not all the required WAL is present in pg_xlog anymore. And
by not copying the WAL, we can avoid some copying, as restore_command or
streaming replication will only copy what's needed, while pg_rewind
would copy all WAL it can find the target's data directory.

pg_basebackup also doesn't include any WAL, unless you pass the --xlog
option. It would be nice to also add an optional --xlog option to
pg_rewind, but with pg_rewind it's possible that all the required WAL
isn't present in the pg_xlog directory anymore, so you wouldn't always
achieve the same effect of making the backup self-contained.

So, I propose the attached. It makes pg_rewind ignore the pg_xlog
directory in both the source and the target.

- Heikki

Attachments:

ignore-pg_xlog-in-pg_rewind-1.patchtext/x-diff; name=ignore-pg_xlog-in-pg_rewind-1.patchDownload
diff --git a/doc/src/sgml/ref/pg_rewind.sgml b/doc/src/sgml/ref/pg_rewind.sgml
index 32dc83f..bb1d640 100644
--- a/doc/src/sgml/ref/pg_rewind.sgml
+++ b/doc/src/sgml/ref/pg_rewind.sgml
@@ -49,12 +49,13 @@ PostgreSQL documentation
 
   <para>
    The result is equivalent to replacing the target data directory with the
-   source one. All files are copied, including configuration files. The
-   advantage of <application>pg_rewind</> over taking a new base backup, or
-   tools like <application>rsync</>, is that <application>pg_rewind</> does
-   not require reading through all unchanged files in the cluster. That makes
-   it a lot faster when the database is large and only a small portion of it
-   differs between the clusters.
+   source one. All files in the data directory are copied, including
+   configuration files, but excluding the WAL log directory,
+   <filename>pg_xlog</>. The advantage of <application>pg_rewind</> over
+   taking a new base backup, or tools like <application>rsync</>, is that
+   <application>pg_rewind</> does not require reading through all unchanged
+   files in the cluster. That makes it a lot faster when the database is
+   large and only a small portion of it differs between the clusters.
   </para>
 
   <para>
@@ -74,12 +75,12 @@ PostgreSQL documentation
    When the target server is started up for the first time after running
    <application>pg_rewind</>, it will go into recovery mode and replay all
    WAL generated in the source server after the point of divergence.
-   If some of the WAL was no longer available in the source server when
-   <application>pg_rewind</> was run, and therefore could not be copied by
-   <application>pg_rewind</> session, it needs to be made available when the
-   target server is started up. That can be done by creating a
-   <filename>recovery.conf</> file in the target data directory with a
-   suitable <varname>restore_command</>.
+   Like after restoring from a base backup using continous archiving,
+   the source server's WAL must be made available to the rewound server.
+   This can be done by creating a <filename>recovery.conf</> file in the
+   target data directory with a suitable <varname>restore_command</> or
+   <varname>primary_conninfo</> line, or by copying the required WAL files
+   manually into <filename>pg_xlog</> in the target data directory.
   </para>
  </refsect1>
 
diff --git a/src/bin/pg_rewind/RewindTest.pm b/src/bin/pg_rewind/RewindTest.pm
index 5219ec9..82e3f4c 100644
--- a/src/bin/pg_rewind/RewindTest.pm
+++ b/src/bin/pg_rewind/RewindTest.pm
@@ -284,14 +284,12 @@ sub run_pg_rewind
 
 	# Keep a temporary postgresql.conf for master node or it would be
 	# overwritten during the rewind.
-	copy(
-		"$test_master_datadir/postgresql.conf",
-		"$testroot/master-postgresql.conf.tmp");
+	copy("$test_master_datadir/postgresql.conf",
+		 "$testroot/master-postgresql.conf.tmp");
 
 	# Now run pg_rewind
 	if ($test_mode eq "local")
 	{
-
 		# Do rewind using a local pgdata as source
 		# Stop the master and be ready to perform the rewind
 		system_or_bail(
@@ -306,10 +304,22 @@ sub run_pg_rewind
 			$log_path,
 			'2>&1');
 		ok($result, 'pg_rewind local');
+
+		# Copy WAL files from the source server so that recovery will find
+		# them. (In remote mode, the source server is still running, and the
+		# rewound server will use streaming replication to connect to the
+		# source server and fetch the WAL from there.)
+		opendir(DIR,"$test_standby_datadir/pg_xlog") or die "Cannot open $test_standby_datadir/pg_xlo\n";
+		my @xlogfiles = readdir(DIR);
+		closedir(DIR);
+
+		foreach my $xlogfile (@xlogfiles) {
+			copy("$test_standby_datadir/pg_xlog/$xlogfile",
+				 "$test_master_datadir/pg_xlog/");
+		}
 	}
 	elsif ($test_mode eq "remote")
 	{
-
 		# Do rewind using a remote connection as source
 		my $result = run(
 			[   'pg_rewind',
diff --git a/src/bin/pg_rewind/copy_fetch.c b/src/bin/pg_rewind/copy_fetch.c
index 224fad1..f437391 100644
--- a/src/bin/pg_rewind/copy_fetch.c
+++ b/src/bin/pg_rewind/copy_fetch.c
@@ -74,6 +74,9 @@ recurse_dir(const char *datadir, const char *parentpath,
 			strcmp(xlde->d_name, "..") == 0)
 			continue;
 
+		if (parentpath == NULL && strcmp(xlde->d_name, "pg_xlog") == 0)
+			continue;
+
 		snprintf(fullpath, MAXPGPATH, "%s/%s", fullparentpath, xlde->d_name);
 
 		if (lstat(fullpath, &fst) < 0)
diff --git a/src/bin/pg_rewind/libpq_fetch.c b/src/bin/pg_rewind/libpq_fetch.c
index 05aa133..2f411e3 100644
--- a/src/bin/pg_rewind/libpq_fetch.c
+++ b/src/bin/pg_rewind/libpq_fetch.c
@@ -151,6 +151,7 @@ libpqProcessFileList(void)
 		"  SELECT '' AS path, filename, size, isdir FROM\n"
 		"  (SELECT pg_ls_dir('.', true, false) AS filename) AS fn,\n"
 		"        pg_stat_file(fn.filename, true) AS this\n"
+		"   WHERE filename <> 'pg_xlog'"
 		"  UNION ALL\n"
 		"  SELECT parent.path || parent.filename || '/' AS path,\n"
 		"         fn, this.size, this.isdir\n"
#32Michael Paquier
michael.paquier@gmail.com
In reply to: Heikki Linnakangas (#31)
Re: pg_rewind failure by file deletion in source server

On Wed, Jul 1, 2015 at 2:21 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

Hmm. I'm starting to think that pg_rewind should ignore pg_xlog entirely. In
any non-trivial scenarios, just copying all the files from pg_xlog isn't
enough anyway, and you need to set up a recovery.conf after running
pg_rewind that contains a restore_command or primary_conninfo, to fetch the
WAL. So you can argue that by not copying pg_xlog automatically, we're
actually doing a favour to the DBA, by forcing him to set up the
recovery.conf file correctly. Because if you just test simple scenarios
where not much time has passed between the failover and running pg_rewind,
it might be enough to just copy all the WAL currently in pg_xlog, but it
would not be enough if more time had passed and not all the required WAL is
present in pg_xlog anymore. And by not copying the WAL, we can avoid some
copying, as restore_command or streaming replication will only copy what's
needed, while pg_rewind would copy all WAL it can find the target's data
directory.
pg_basebackup also doesn't include any WAL, unless you pass the --xlog
option. It would be nice to also add an optional --xlog option to pg_rewind,
but with pg_rewind it's possible that all the required WAL isn't present in
the pg_xlog directory anymore, so you wouldn't always achieve the same
effect of making the backup self-contained.

Those are very convincing arguments.

So, I propose the attached. It makes pg_rewind ignore the pg_xlog directory
in both the source and the target.

Minor thing: s/continous/continuous. Except that this patch looks sane to me.

(btw, it seems to me that we still have a race condition with
pg_tablespace_location).
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#33Fujii Masao
masao.fujii@gmail.com
In reply to: Heikki Linnakangas (#31)
Re: pg_rewind failure by file deletion in source server

On Wed, Jul 1, 2015 at 2:21 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 06/29/2015 09:44 AM, Michael Paquier wrote:

On Mon, Jun 29, 2015 at 4:55 AM, Heikki Linnakangas wrote:

But we'll still need to handle the pg_xlog symlink case somehow. Perhaps
it
would be enough to special-case pg_xlog for now.

Well, sure, pg_rewind does not copy the soft links either way. Now it
would be nice to have an option to be able to recreate the soft link
of at least pg_xlog even if it can be scripted as well after a run.

Hmm. I'm starting to think that pg_rewind should ignore pg_xlog entirely. In
any non-trivial scenarios, just copying all the files from pg_xlog isn't
enough anyway, and you need to set up a recovery.conf after running
pg_rewind that contains a restore_command or primary_conninfo, to fetch the
WAL. So you can argue that by not copying pg_xlog automatically, we're
actually doing a favour to the DBA, by forcing him to set up the
recovery.conf file correctly. Because if you just test simple scenarios
where not much time has passed between the failover and running pg_rewind,
it might be enough to just copy all the WAL currently in pg_xlog, but it
would not be enough if more time had passed and not all the required WAL is
present in pg_xlog anymore. And by not copying the WAL, we can avoid some
copying, as restore_command or streaming replication will only copy what's
needed, while pg_rewind would copy all WAL it can find the target's data
directory.

pg_basebackup also doesn't include any WAL, unless you pass the --xlog
option. It would be nice to also add an optional --xlog option to pg_rewind,
but with pg_rewind it's possible that all the required WAL isn't present in
the pg_xlog directory anymore, so you wouldn't always achieve the same
effect of making the backup self-contained.

So, I propose the attached. It makes pg_rewind ignore the pg_xlog directory
in both the source and the target.

If pg_xlog is simply ignored, some old WAL files may remain in target server.
Don't these old files cause the subsequent startup of target server as new
standby to fail? That is, it's the case where the WAL file with the same name
but different content exist both in target and source. If that's harmfull,
pg_rewind also should remove the files in pg_xlog of target server.

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#34Michael Paquier
michael.paquier@gmail.com
In reply to: Fujii Masao (#33)
1 attachment(s)
Re: pg_rewind failure by file deletion in source server

On Wed, Jul 1, 2015 at 9:31 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Wed, Jul 1, 2015 at 2:21 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 06/29/2015 09:44 AM, Michael Paquier wrote:

On Mon, Jun 29, 2015 at 4:55 AM, Heikki Linnakangas wrote:

But we'll still need to handle the pg_xlog symlink case somehow. Perhaps
it
would be enough to special-case pg_xlog for now.

Well, sure, pg_rewind does not copy the soft links either way. Now it
would be nice to have an option to be able to recreate the soft link
of at least pg_xlog even if it can be scripted as well after a run.

Hmm. I'm starting to think that pg_rewind should ignore pg_xlog entirely. In
any non-trivial scenarios, just copying all the files from pg_xlog isn't
enough anyway, and you need to set up a recovery.conf after running
pg_rewind that contains a restore_command or primary_conninfo, to fetch the
WAL. So you can argue that by not copying pg_xlog automatically, we're
actually doing a favour to the DBA, by forcing him to set up the
recovery.conf file correctly. Because if you just test simple scenarios
where not much time has passed between the failover and running pg_rewind,
it might be enough to just copy all the WAL currently in pg_xlog, but it
would not be enough if more time had passed and not all the required WAL is
present in pg_xlog anymore. And by not copying the WAL, we can avoid some
copying, as restore_command or streaming replication will only copy what's
needed, while pg_rewind would copy all WAL it can find the target's data
directory.

pg_basebackup also doesn't include any WAL, unless you pass the --xlog
option. It would be nice to also add an optional --xlog option to pg_rewind,
but with pg_rewind it's possible that all the required WAL isn't present in
the pg_xlog directory anymore, so you wouldn't always achieve the same
effect of making the backup self-contained.

So, I propose the attached. It makes pg_rewind ignore the pg_xlog directory
in both the source and the target.

If pg_xlog is simply ignored, some old WAL files may remain in target server.
Don't these old files cause the subsequent startup of target server as new
standby to fail? That is, it's the case where the WAL file with the same name
but different content exist both in target and source. If that's harmfull,
pg_rewind also should remove the files in pg_xlog of target server.

This would reduce usability. The rewound node will replay WAL from the
previous checkpoint where WAL forked up to the minimum recovery point
of source node where pg_rewind has been run. Hence if we remove
completely the contents of pg_xlog we'd lose a portion of the logs
that need to be replayed until timeline is switched on the rewound
node when recovering it (while streaming from the promoted standby,
whatever). I don't really see why recycled segments would be a
problem, as that's perhaps what you are referring to, but perhaps I am
missing something.

Attached is a rebased version of the previous patch to ignore the
contents of pg_xlog/ when rewinding.
--
Michael

Attachments:

20150717_pgrewind_ignore_xlog.patchbinary/octet-stream; name=20150717_pgrewind_ignore_xlog.patchDownload
diff --git a/doc/src/sgml/ref/pg_rewind.sgml b/doc/src/sgml/ref/pg_rewind.sgml
index 32dc83f..bb1d640 100644
--- a/doc/src/sgml/ref/pg_rewind.sgml
+++ b/doc/src/sgml/ref/pg_rewind.sgml
@@ -49,12 +49,13 @@ PostgreSQL documentation
 
   <para>
    The result is equivalent to replacing the target data directory with the
-   source one. All files are copied, including configuration files. The
-   advantage of <application>pg_rewind</> over taking a new base backup, or
-   tools like <application>rsync</>, is that <application>pg_rewind</> does
-   not require reading through all unchanged files in the cluster. That makes
-   it a lot faster when the database is large and only a small portion of it
-   differs between the clusters.
+   source one. All files in the data directory are copied, including
+   configuration files, but excluding the WAL log directory,
+   <filename>pg_xlog</>. The advantage of <application>pg_rewind</> over
+   taking a new base backup, or tools like <application>rsync</>, is that
+   <application>pg_rewind</> does not require reading through all unchanged
+   files in the cluster. That makes it a lot faster when the database is
+   large and only a small portion of it differs between the clusters.
   </para>
 
   <para>
@@ -74,12 +75,12 @@ PostgreSQL documentation
    When the target server is started up for the first time after running
    <application>pg_rewind</>, it will go into recovery mode and replay all
    WAL generated in the source server after the point of divergence.
-   If some of the WAL was no longer available in the source server when
-   <application>pg_rewind</> was run, and therefore could not be copied by
-   <application>pg_rewind</> session, it needs to be made available when the
-   target server is started up. That can be done by creating a
-   <filename>recovery.conf</> file in the target data directory with a
-   suitable <varname>restore_command</>.
+   Like after restoring from a base backup using continous archiving,
+   the source server's WAL must be made available to the rewound server.
+   This can be done by creating a <filename>recovery.conf</> file in the
+   target data directory with a suitable <varname>restore_command</> or
+   <varname>primary_conninfo</> line, or by copying the required WAL files
+   manually into <filename>pg_xlog</> in the target data directory.
   </para>
  </refsect1>
 
diff --git a/src/bin/pg_rewind/RewindTest.pm b/src/bin/pg_rewind/RewindTest.pm
index db5e90b..4d0fe6d 100644
--- a/src/bin/pg_rewind/RewindTest.pm
+++ b/src/bin/pg_rewind/RewindTest.pm
@@ -290,6 +290,21 @@ sub run_pg_rewind
 					"--source-pgdata=$test_standby_datadir",
 					"--target-pgdata=$test_master_datadir"],
 				   'pg_rewind local');
+
+		# Copy WAL files from the source server so that recovery will find
+		# them. (In remote mode, the source server is still running, and the
+		# rewound server will use streaming replication to connect to the
+		# source server and fetch the WAL from there.)
+		opendir(DIR,"$test_standby_datadir/pg_xlog") or
+			die "Cannot open $test_standby_datadir/pg_xlog\n";
+		my @xlogfiles = readdir(DIR);
+		closedir(DIR);
+
+		foreach my $xlogfile (@xlogfiles)
+		{
+			copy("$test_standby_datadir/pg_xlog/$xlogfile",
+				 "$test_master_datadir/pg_xlog/");
+		}
 	}
 	elsif ($test_mode eq "remote")
 	{
diff --git a/src/bin/pg_rewind/copy_fetch.c b/src/bin/pg_rewind/copy_fetch.c
index 224fad1..f437391 100644
--- a/src/bin/pg_rewind/copy_fetch.c
+++ b/src/bin/pg_rewind/copy_fetch.c
@@ -74,6 +74,9 @@ recurse_dir(const char *datadir, const char *parentpath,
 			strcmp(xlde->d_name, "..") == 0)
 			continue;
 
+		if (parentpath == NULL && strcmp(xlde->d_name, "pg_xlog") == 0)
+			continue;
+
 		snprintf(fullpath, MAXPGPATH, "%s/%s", fullparentpath, xlde->d_name);
 
 		if (lstat(fullpath, &fst) < 0)
diff --git a/src/bin/pg_rewind/libpq_fetch.c b/src/bin/pg_rewind/libpq_fetch.c
index 1979fbc..bf99e9a 100644
--- a/src/bin/pg_rewind/libpq_fetch.c
+++ b/src/bin/pg_rewind/libpq_fetch.c
@@ -154,6 +154,7 @@ libpqProcessFileList(void)
 		"  SELECT '' AS path, filename, size, isdir FROM\n"
 		"  (SELECT pg_ls_dir('.', true, false) AS filename) AS fn,\n"
 		"        pg_stat_file(fn.filename, true) AS this\n"
+		"   WHERE filename <> 'pg_xlog'"
 		"  UNION ALL\n"
 		"  SELECT parent.path || parent.filename || '/' AS path,\n"
 		"         fn, this.size, this.isdir\n"
#35Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Michael Paquier (#34)
Re: pg_rewind failure by file deletion in source server

On 07/17/2015 06:28 AM, Michael Paquier wrote:

On Wed, Jul 1, 2015 at 9:31 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Wed, Jul 1, 2015 at 2:21 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 06/29/2015 09:44 AM, Michael Paquier wrote:

On Mon, Jun 29, 2015 at 4:55 AM, Heikki Linnakangas wrote:

But we'll still need to handle the pg_xlog symlink case somehow. Perhaps
it
would be enough to special-case pg_xlog for now.

Well, sure, pg_rewind does not copy the soft links either way. Now it
would be nice to have an option to be able to recreate the soft link
of at least pg_xlog even if it can be scripted as well after a run.

Hmm. I'm starting to think that pg_rewind should ignore pg_xlog entirely. In
any non-trivial scenarios, just copying all the files from pg_xlog isn't
enough anyway, and you need to set up a recovery.conf after running
pg_rewind that contains a restore_command or primary_conninfo, to fetch the
WAL. So you can argue that by not copying pg_xlog automatically, we're
actually doing a favour to the DBA, by forcing him to set up the
recovery.conf file correctly. Because if you just test simple scenarios
where not much time has passed between the failover and running pg_rewind,
it might be enough to just copy all the WAL currently in pg_xlog, but it
would not be enough if more time had passed and not all the required WAL is
present in pg_xlog anymore. And by not copying the WAL, we can avoid some
copying, as restore_command or streaming replication will only copy what's
needed, while pg_rewind would copy all WAL it can find the target's data
directory.

pg_basebackup also doesn't include any WAL, unless you pass the --xlog
option. It would be nice to also add an optional --xlog option to pg_rewind,
but with pg_rewind it's possible that all the required WAL isn't present in
the pg_xlog directory anymore, so you wouldn't always achieve the same
effect of making the backup self-contained.

So, I propose the attached. It makes pg_rewind ignore the pg_xlog directory
in both the source and the target.

If pg_xlog is simply ignored, some old WAL files may remain in target server.
Don't these old files cause the subsequent startup of target server as new
standby to fail? That is, it's the case where the WAL file with the same name
but different content exist both in target and source. If that's harmfull,
pg_rewind also should remove the files in pg_xlog of target server.

This would reduce usability. The rewound node will replay WAL from the
previous checkpoint where WAL forked up to the minimum recovery point
of source node where pg_rewind has been run. Hence if we remove
completely the contents of pg_xlog we'd lose a portion of the logs
that need to be replayed until timeline is switched on the rewound
node when recovering it (while streaming from the promoted standby,
whatever). I don't really see why recycled segments would be a
problem, as that's perhaps what you are referring to, but perhaps I am
missing something.

Hmm. My thinking was that you need to set up restore_command or
primary_conninfo anyway, to fetch the old WAL, so there's no need to
copy any WAL. But there's a problem with that: you might have WAL files
in the source server that haven't been archived yet, and you need them
to recover the rewound target node. That's OK for libpq mode, I think as
the server is still running and presumably and you can fetch the WAL
with streaming replication, but for copy-mode, that's not a good
assumption. You might be relying on a WAL archive, and the file might
not be archived yet.

Perhaps it's best if we copy all the WAL files from source in copy-mode,
but not in libpq mode. Regarding old WAL files in the target, it's
probably best to always leave them alone. They should do no harm, and as
a general principle it's best to avoid destroying evidence.

It'd be nice to get some fix for this for alpha2, so I'll commit a fix
to do that on Monday, unless we come to a different conclusion before that.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#36Michael Paquier
michael.paquier@gmail.com
In reply to: Heikki Linnakangas (#35)
Re: pg_rewind failure by file deletion in source server

On Sun, Aug 2, 2015 at 4:01 AM, Heikki Linnakangas wrote:

Perhaps it's best if we copy all the WAL files from source in copy-mode, but
not in libpq mode. Regarding old WAL files in the target, it's probably best
to always leave them alone. They should do no harm, and as a general
principle it's best to avoid destroying evidence.

It'd be nice to get some fix for this for alpha2, so I'll commit a fix to do
that on Monday, unless we come to a different conclusion before that.

+1. Both things sound like a good plan to me.
-- 
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#37Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Michael Paquier (#36)
Re: pg_rewind failure by file deletion in source server

On 08/03/2015 07:01 AM, Michael Paquier wrote:

On Sun, Aug 2, 2015 at 4:01 AM, Heikki Linnakangas wrote:

Perhaps it's best if we copy all the WAL files from source in copy-mode, but
not in libpq mode. Regarding old WAL files in the target, it's probably best
to always leave them alone. They should do no harm, and as a general
principle it's best to avoid destroying evidence.

It'd be nice to get some fix for this for alpha2, so I'll commit a fix to do
that on Monday, unless we come to a different conclusion before that.

+1. Both things sound like a good plan to me.

I had some trouble implementing that. Recovery seemed to get confused
sometimes, when it didn't find some of the WAL files in pg_xlog
directory, even though it could fetch them through streaming
replication. I'll have to investigate that further, but in the meantime,
to have some fix in place for alpha2, I committed an even simpler fix
for the immediate issue that pg_xlog is a symlink: just pretend that
"pg_xlog" is a normal directory, even when it's a symlink.

I'll continue to investigate what was wrong with my initial attempt. And
it would be nice to avoid copying the pre-allocated WAL files from the
source, because it's really unnecessary. But this fixes the immediate
problem that pg_rewind didn't work at all if pg_xlog was a symlink.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#38Fujii Masao
masao.fujii@gmail.com
In reply to: Michael Paquier (#34)
Re: pg_rewind failure by file deletion in source server

On Fri, Jul 17, 2015 at 12:28 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Wed, Jul 1, 2015 at 9:31 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Wed, Jul 1, 2015 at 2:21 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 06/29/2015 09:44 AM, Michael Paquier wrote:

On Mon, Jun 29, 2015 at 4:55 AM, Heikki Linnakangas wrote:

But we'll still need to handle the pg_xlog symlink case somehow. Perhaps
it
would be enough to special-case pg_xlog for now.

Well, sure, pg_rewind does not copy the soft links either way. Now it
would be nice to have an option to be able to recreate the soft link
of at least pg_xlog even if it can be scripted as well after a run.

Hmm. I'm starting to think that pg_rewind should ignore pg_xlog entirely. In
any non-trivial scenarios, just copying all the files from pg_xlog isn't
enough anyway, and you need to set up a recovery.conf after running
pg_rewind that contains a restore_command or primary_conninfo, to fetch the
WAL. So you can argue that by not copying pg_xlog automatically, we're
actually doing a favour to the DBA, by forcing him to set up the
recovery.conf file correctly. Because if you just test simple scenarios
where not much time has passed between the failover and running pg_rewind,
it might be enough to just copy all the WAL currently in pg_xlog, but it
would not be enough if more time had passed and not all the required WAL is
present in pg_xlog anymore. And by not copying the WAL, we can avoid some
copying, as restore_command or streaming replication will only copy what's
needed, while pg_rewind would copy all WAL it can find the target's data
directory.

pg_basebackup also doesn't include any WAL, unless you pass the --xlog
option. It would be nice to also add an optional --xlog option to pg_rewind,
but with pg_rewind it's possible that all the required WAL isn't present in
the pg_xlog directory anymore, so you wouldn't always achieve the same
effect of making the backup self-contained.

So, I propose the attached. It makes pg_rewind ignore the pg_xlog directory
in both the source and the target.

If pg_xlog is simply ignored, some old WAL files may remain in target server.
Don't these old files cause the subsequent startup of target server as new
standby to fail? That is, it's the case where the WAL file with the same name
but different content exist both in target and source. If that's harmfull,
pg_rewind also should remove the files in pg_xlog of target server.

This would reduce usability. The rewound node will replay WAL from the
previous checkpoint where WAL forked up to the minimum recovery point
of source node where pg_rewind has been run. Hence if we remove
completely the contents of pg_xlog we'd lose a portion of the logs
that need to be replayed until timeline is switched on the rewound
node when recovering it (while streaming from the promoted standby,
whatever).

Even if we remove the WAL files in *target server", we don't lose any
files in *source server" that we will need to replay later.

I don't really see why recycled segments would be a
problem, as that's perhaps what you are referring to, but perhaps I am
missing something.

Please imagine the case where the WAL files with the same name were
created in both servers after the fork. Their contents may be different.
After pg_rewind is executed successfully, the rewound server
(i.e., target server) should retrieve and replay that WAL file from
the *source* server. But the problem is that the rewound server tries to
replay the WAL file from its local since the file exists locally (even
if primary_conninfo is specified).

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers