Small race in pg_xlogdump --follow
When using pg_xlogdump in --follow mode, there is what I believe is a small
race condition when files are switched.
If pg_xlogdump detects then end of one file (by looking at the record) it
will immediately try to open the following file. If that file has not yet
been created, it will fail with an error message saying it cannot open the
file. But there's a race condition where the server has not had time to put
the file in place.
Attached patch puts a retry loop around opening the file that retries for 5
seconds (which is excessive, but should be safe) in case the file is
missing (and still fails out for other error messages of course).
Comments?
--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/
Attachments:
pg_xlogdump_race.patchtext/x-patch; charset=US-ASCII; name=pg_xlogdump_race.patchDownload
diff --git a/src/bin/pg_xlogdump/pg_xlogdump.c b/src/bin/pg_xlogdump/pg_xlogdump.c
index 02575eb..a0e0a0c 100644
--- a/src/bin/pg_xlogdump/pg_xlogdump.c
+++ b/src/bin/pg_xlogdump/pg_xlogdump.c
@@ -249,6 +249,7 @@ XLogDumpXLogRead(const char *directory, TimeLineID timeline_id,
if (sendFile < 0 || !XLByteInSeg(recptr, sendSegNo))
{
char fname[MAXFNAMELEN];
+ int tries;
/* Switch to another logfile segment */
if (sendFile >= 0)
@@ -258,7 +259,20 @@ XLogDumpXLogRead(const char *directory, TimeLineID timeline_id,
XLogFileName(fname, timeline_id, sendSegNo);
- sendFile = fuzzy_open_file(directory, fname);
+ for (tries = 0; tries < 10; tries++)
+ {
+ sendFile = fuzzy_open_file(directory, fname);
+ if (sendFile >= 0)
+ break;
+ if (errno == ENOENT)
+ {
+ /* File not there yet, try again */
+ pg_usleep(500*1000);
+ continue;
+ }
+ /* Any other error, fall through and fail */
+ break;
+ }
if (sendFile < 0)
fatal_error("could not find file \"%s\": %s",
Magnus Hagander <magnus@hagander.net> writes:
Attached patch puts a retry loop around opening the file that retries for 5
seconds (which is excessive, but should be safe) in case the file is
missing (and still fails out for other error messages of course).
Comments?
The patch assumes that pg_usleep won't change errno, an assumption
I have little faith in.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Sep 26, 2016 at 3:44 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Magnus Hagander <magnus@hagander.net> writes:
Attached patch puts a retry loop around opening the file that retries
for 5
seconds (which is excessive, but should be safe) in case the file is
missing (and still fails out for other error messages of course).Comments?
The patch assumes that pg_usleep won't change errno, an assumption
I have little faith in.
Oh, right, at the very last loop. I've never seen it need more than 1 loop
so I didn't manage to hit that codepath :) But yeah, saving errno and
restoring it on the other side of pg_usleep is probably a good idea.
--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/
Attachments:
pg_xlogdump_race2.patchtext/x-patch; charset=US-ASCII; name=pg_xlogdump_race2.patchDownload
diff --git a/src/bin/pg_xlogdump/pg_xlogdump.c b/src/bin/pg_xlogdump/pg_xlogdump.c
index 02575eb..6947c0c 100644
--- a/src/bin/pg_xlogdump/pg_xlogdump.c
+++ b/src/bin/pg_xlogdump/pg_xlogdump.c
@@ -249,6 +249,7 @@ XLogDumpXLogRead(const char *directory, TimeLineID timeline_id,
if (sendFile < 0 || !XLByteInSeg(recptr, sendSegNo))
{
char fname[MAXFNAMELEN];
+ int tries;
/* Switch to another logfile segment */
if (sendFile >= 0)
@@ -258,7 +259,24 @@ XLogDumpXLogRead(const char *directory, TimeLineID timeline_id,
XLogFileName(fname, timeline_id, sendSegNo);
- sendFile = fuzzy_open_file(directory, fname);
+ for (tries = 0; tries < 10; tries++)
+ {
+ sendFile = fuzzy_open_file(directory, fname);
+ if (sendFile >= 0)
+ break;
+ if (errno == ENOENT)
+ {
+ int save_errno = errno;
+
+ /* File not there yet, try again */
+ pg_usleep(500 * 1000);
+
+ errno = save_errno;
+ continue;
+ }
+ /* Any other error, fall through and fail */
+ break;
+ }
if (sendFile < 0)
fatal_error("could not find file \"%s\": %s",
Magnus Hagander <magnus@hagander.net> writes:
Oh, right, at the very last loop. I've never seen it need more than 1 loop
so I didn't manage to hit that codepath :) But yeah, saving errno and
restoring it on the other side of pg_usleep is probably a good idea.
Right. On a larger scale --- I'm too uncaffeinated to figure out whether
this is the code path taken for the initially user-supplied file name.
But it would be unfriendly if when you fat-finger the command line
arguments, it takes 5 seconds for pg_xlogdump to tell you so. Maybe
the looping behavior should only happen on non-first file access.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Sep 26, 2016 at 3:59 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Magnus Hagander <magnus@hagander.net> writes:
Oh, right, at the very last loop. I've never seen it need more than 1
loop
so I didn't manage to hit that codepath :) But yeah, saving errno and
restoring it on the other side of pg_usleep is probably a good idea.Right. On a larger scale --- I'm too uncaffeinated to figure out whether
this is the code path taken for the initially user-supplied file name.
But it would be unfriendly if when you fat-finger the command line
arguments, it takes 5 seconds for pg_xlogdump to tell you so. Maybe
the looping behavior should only happen on non-first file access.
It's a different codepath, so the retry does not happen on that. Different
error message, even. ("could not open file" vs "could not find file").
I'll go commit this including a source code comment about why the loop is
there that was missing.
--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/