initdb -S and tablespaces
Hi.
I just noticed that initdb -S ("Safely write all database files to disk
and exit") does (only) the following in perform_fsync:
pre_sync_fname(pdir, true);
walkdir(pg_data, pre_sync_fname);
fsync_fname(pdir, true);
walkdir(pg_data, fsync_fname);
walkdir() reads the directory and calls itself recursively for S_ISDIR
entries, or calls the function for S_ISREG entries… which means it
doesn't follow links.
Which means it doesn't fsync the contents of tablespaces.
-- Abhijit
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2014-09-29 14:09:01 +0530, Abhijit Menon-Sen wrote:
Hi.
I just noticed that initdb -S ("Safely write all database files to disk
and exit") does (only) the following in perform_fsync:pre_sync_fname(pdir, true);
walkdir(pg_data, pre_sync_fname);fsync_fname(pdir, true);
walkdir(pg_data, fsync_fname);walkdir() reads the directory and calls itself recursively for S_ISDIR
entries, or calls the function for S_ISREG entries… which means it
doesn't follow links.Which means it doesn't fsync the contents of tablespaces.
Which means at least pg_upgrade is unsafe right
now... c.f. 630cd14426dc1daf85163ad417f3a224eb4ac7b0.
Note that the perform_fsync() *was* ok for its original purpose in
initdb. At the end of initdb there's no relevant tablespaces. But if
used *after* pg_upgrade, that's not necessarily the case.
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
At 2014-09-29 11:54:10 +0200, andres@2ndquadrant.com wrote:
Note that the perform_fsync() *was* ok for its original purpose in
initdb. At the end of initdb there's no relevant tablespaces. But if
used *after* pg_upgrade, that's not necessarily the case.
Right.
So, since I'm writing a function to fsync everything inside PGDATA
anyway, it makes sense to call it both from initdb and StartupXLOG.
It'll do what initdb -S now does, plus follow links in pg_tblspc.
Any suggestions about where to put such a function? (I was looking at
backend/utils/init, but I'm not sure that's a good place for this.)
-- Abhijit
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2014-09-29 15:43:32 +0530, Abhijit Menon-Sen wrote:
At 2014-09-29 11:54:10 +0200, andres@2ndquadrant.com wrote:
Note that the perform_fsync() *was* ok for its original purpose in
initdb. At the end of initdb there's no relevant tablespaces. But if
used *after* pg_upgrade, that's not necessarily the case.Right.
So, since I'm writing a function to fsync everything inside PGDATA
anyway, it makes sense to call it both from initdb and StartupXLOG.
It'll do what initdb -S now does, plus follow links in pg_tblspc.Any suggestions about where to put such a function? (I was looking at
backend/utils/init, but I'm not sure that's a good place for this.)
That can't work unfortunately. Both frontend and backend code need to
execute it... I'm not sure it's realistic to handle both cases the
same. The error handling, opening files/directories, and all will be
different. It'll also make backpatching hard :(.
So I'm afraid at least in a first patch it'll need to be a bit of
duplication. Fixing initdb's code back to 9.3 and the backend all the
way back to 9.0.
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
At 2014-09-29 12:59:09 +0200, andres@2ndquadrant.com wrote:
So I'm afraid at least in a first patch it'll need to be a bit of
duplication. Fixing initdb's code back to 9.3 and the backend all
the way back to 9.0.
OK, thanks, I'll submit two separate patches then.
-- Abhijit
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
At 2014-09-29 11:54:10 +0200, andres@2ndquadrant.com wrote:
On 2014-09-29 14:09:01 +0530, Abhijit Menon-Sen wrote:
I just noticed that initdb -S ("Safely write all database files to disk
and exit") does (only) the following in perform_fsync:pre_sync_fname(pdir, true);
walkdir(pg_data, pre_sync_fname);fsync_fname(pdir, true);
walkdir(pg_data, fsync_fname);walkdir() reads the directory and calls itself recursively for S_ISDIR
entries, or calls the function for S_ISREG entries… which means it
doesn't follow links.Which means it doesn't fsync the contents of tablespaces.
Which means at least pg_upgrade is unsafe right
now... c.f. 630cd14426dc1daf85163ad417f3a224eb4ac7b0.
Here's a proposed patch to initdb to make initdb -S fsync everything
under pg_tblspc. It introduces a new function that calls walkdir on
every entry under pg_tblspc. This is only one approach: I could have
also changed walkdir to follow links, but that would have required a
bunch of #ifdefs for Windows (because it doesn't have symlinks), and I
guessed a separate function with two calls might be easier to patch into
back branches. I've tested this patch under various conditions on Linux,
but it could use some testing on Windows.
-- Abhijit
Attachments:
tblspclinks.difftext/x-diff; charset=us-asciiDownload+56-0
At 2014-10-30 14:30:27 +0530, ams@2ndQuadrant.com wrote:
Here's a proposed patch to initdb to make initdb -S fsync everything
under pg_tblspc.
Oops, I meant to include the corresponding patch to xlog.c to do the
same at startup. It's based on the initdb patch, but modified to not
use fprintf/exit_nicely and so on. (Note that this was written in a
single chunk to aid backpatching. There's no attempt made to share
code in this set of patches.)
Now attached.
-- Abhijit
Attachments:
0001-If-we-need-to-perform-crash-recovery-fsync-PGDATA-re.patchtext/x-diff; charset=us-asciiDownload+184-1
On 2014-10-30 14:30:28 +0530, Abhijit Menon-Sen wrote:
Here's a proposed patch to initdb to make initdb -S fsync everything
under pg_tblspc. It introduces a new function that calls walkdir on
every entry under pg_tblspc. This is only one approach: I could have
also changed walkdir to follow links, but that would have required a
bunch of #ifdefs for Windows (because it doesn't have symlinks), and I
guessed a separate function with two calls might be easier to patch into
back branches. I've tested this patch under various conditions on Linux,
but it could use some testing on Windows.
I've pushed this. The windows buildfarm animals that run pg_upgrade (and
thus --sync-only) will have to tell us whether there's a problem. I sure
hope there's none...
Thanks for that patch!
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hi,
On 2014-11-06 17:56:53 +0530, Abhijit Menon-Sen wrote:
+ /* + * If we need to perform crash recovery, we issue an fsync on the + * data directory and its contents to try to ensure that any data + * written before the crash are flushed to disk. Otherwise a power + * failure in the near future might cause earlier unflushed writes + * to be lost, even though more recent data written to disk from + * here on would be persisted. + */ + + if (ControlFile->state != DB_SHUTDOWNED && + ControlFile->state != DB_SHUTDOWNED_IN_RECOVERY) + perform_fsync(data_directory); +
a) Please think of a slightly more descriptive name than perform_fsync
b) I think we should check here whether fsync is enabled.
c) I'm wondering if we should add fsync to the control file and also
perform an fsync if the last shutdown was clear, but fsync was
disabled.
if (ControlFile->state == DB_SHUTDOWNED) { /* This is the expected case, so don't be chatty in standalone mode */ @@ -11262,3 +11281,168 @@ SetWalWriterSleeping(bool sleeping) XLogCtl->WalWriterSleeping = sleeping; SpinLockRelease(&XLogCtl->info_lck); } + +/* + * Hint to the OS that it should get ready to fsync() this file. + * + * Adapted from pre_sync_fname in initdb.c + */ +static void +pre_sync_fname(char *fname, bool isdir) +{
this essentially already exists in the backend inparts. C.f. pg_flush_data.
+/* + * walkdir: recursively walk a directory, applying the action to each + * regular file and directory (including the named directory itself). + * + * Adapted from copydir() in copydir.c. + */ +static void +walkdir(char *path, void (*action) (char *fname, bool isdir)) +{ + DIR *dir; + struct dirent *de; + + dir = AllocateDir(path); + while ((de = ReadDir(dir, path)) != NULL) + { + char subpath[MAXPGPATH]; + struct stat fst; + + CHECK_FOR_INTERRUPTS(); + + if (strcmp(de->d_name, ".") == 0 || + strcmp(de->d_name, "..") == 0) + continue; + + snprintf(subpath, MAXPGPATH, "%s/%s", path, de->d_name); + + if (lstat(subpath, &fst) < 0) + ereport(ERROR, + (errcode_for_file_access(), + errmsg("could not stat file \"%s\": %m", subpath))); + + if (S_ISDIR(fst.st_mode)) + walkdir(subpath, action); + else if (S_ISREG(fst.st_mode)) + (*action) (subpath, false);
Theoretically you should also invoke fsync on directories.
+/* + * Issue fsync recursively on PGDATA and all its contents, including the + * links under pg_tblspc. + * + * Adapted from perform_fsync in initdb.c + */ +static void +perform_fsync(char *pg_data) +{ + char pdir[MAXPGPATH]; + char pg_tblspc[MAXPGPATH]; + + /* + * We need to name the parent of PGDATA. get_parent_directory() isn't + * enough here, because it can result in an empty string. + */ + snprintf(pdir, MAXPGPATH, "%s/..", pg_data); + canonicalize_path(pdir);
Hm. Why is this neded? Just syncing . should work?
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
At 2015-01-14 11:59:08 +0100, andres@2ndquadrant.com wrote:
+ if (ControlFile->state != DB_SHUTDOWNED && + ControlFile->state != DB_SHUTDOWNED_IN_RECOVERY) + perform_fsync(data_directory); +a) Please think of a slightly more descriptive name than perform_fsync
OK. (I just copied the name initdb uses, because at the time I was still
thinking in terms of a later patch moving this to src/common.) What do
you think of fsync_recursively? fsync_pgdata?
I think fsync_recursively(data_directory) reads well.
b) I think we should check here whether fsync is enabled.
OK, will do.
c) I'm wondering if we should add fsync to the control file and also
perform an fsync if the last shutdown was clear, but fsync was
disabled.
Explain? "Add fsync to the control file" means store the value of the
fsync GUC setting in the control file? And would the fsync you mention
be dependent on the setting, or unconditional?
+static void +pre_sync_fname(char *fname, bool isdir) +{this essentially already exists in the backend inparts. C.f.
pg_flush_data.
OK, I missed that. I'll rework the patch to use it.
Theoretically you should also invoke fsync on directories.
OK.
+ * We need to name the parent of PGDATA. get_parent_directory() isn't + * enough here, because it can result in an empty string. + */ + snprintf(pdir, MAXPGPATH, "%s/..", pg_data); + canonicalize_path(pdir);Hm. Why is this neded? Just syncing . should work?
Not sure, will investigate.
Thanks.
-- Abhijit
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2015-01-15 11:02:43 +0530, Abhijit Menon-Sen wrote:
At 2015-01-14 11:59:08 +0100, andres@2ndquadrant.com wrote:
+ if (ControlFile->state != DB_SHUTDOWNED && + ControlFile->state != DB_SHUTDOWNED_IN_RECOVERY) + perform_fsync(data_directory); +a) Please think of a slightly more descriptive name than perform_fsync
OK. (I just copied the name initdb uses, because at the time I was still
thinking in terms of a later patch moving this to src/common.) What do
you think of fsync_recursively? fsync_pgdata?
I like fsync_pgdata/datadir or something.
Note that I think you'll have to check/handle pg_xlog being a symlink -
we explicitly support that as a usecase...
c) I'm wondering if we should add fsync to the control file and also
perform an fsync if the last shutdown was clear, but fsync was
disabled.Explain? "Add fsync to the control file" means store the value of the
fsync GUC setting in the control file?
Yes.
And would the fsync you mention be dependent on the setting, or unconditional?
What I am thinking of is that, currently, if you start the server for
initial loading with fsync=off, and then restart it, you're open to data
loss. So when the current config file setting is changed from off to on,
we should fsync the data directory. Even if there was no crash restart.
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
At 2015-01-15 14:32:45 +0100, andres@2ndquadrant.com wrote:
What I am thinking of is that, currently, if you start the server for
initial loading with fsync=off, and then restart it, you're open to
data loss. So when the current config file setting is changed from off
to on, we should fsync the data directory. Even if there was no crash
restart.
Patch attached.
Changes:
1. Renamed perform_fsync to fsync_recursively (otherwise it would read
"fsync_pgdata(pg_data)")
2. Added ControlData->fsync_disabled to record whether fsync was ever
disabled while the server was running (tested in CreateCheckPoint)
3. Run fsync_recursively at startup only if fsync is enabled
4. Run it if we're doing crash recovery, or fsync was disabled
5. Use pg_flush_data in pre_sync_fname
6. Issue fsync on directories too
7. Tested that it works if pg_xlog is a symlink (no changes).
(In short, everything you mentioned in your earlier mail.)
Note that I set ControlData->fsync_disabled to false in BootstrapXLOG,
but it gets set to true during a later CreateCheckPoint(). This means
we run fsync again at startup after initdb. I'm not sure what to do
about that.
Is this about what you had in mind?
-- Abhijit
Attachments:
0001-Recursively-fsync-PGDATA-at-startup-if-needed.patchtext/x-diff; charset=us-asciiDownload+180-2
Abhijit Menon-Sen wrote:
At 2015-01-15 14:32:45 +0100, andres@2ndquadrant.com wrote:
Patch attached.
Changes:
1. Renamed perform_fsync to fsync_recursively (otherwise it would read
"fsync_pgdata(pg_data)")
Okay, but as far as I can tell this function is very specific to
PGDATA; you couldn't use it in any other directory (or pg_tblspc would
be missing and that would cause an error, right?) Therefore I think it
would make sense to have the name reflect this; maybe
fsync_datadir_recursively(data_directory)
or
fsync_pgdata_recursively(data_directory)
would work? But then, since the name is already telling us that it's
all about PGDATA, maybe we can remove the "recursively" part of the
name? Not sure about any of this; other opinions?
I also noticed that walkdir() thinks it is completely agnostic on what
the passed action is; except that the comment at the bottom talks about
how fsync on directories is important, which seems out of place.
I wonder about walktblspc_links() too. Seems to me that that function
is pretty much the same as walkdir(); would it work to add a flag to the
latter to change the behavior in whatever way needs to be changed, and
remove the former? Hmm ... Actually, since surely we must follow
symlinks everywhere, why do we have to do this separately for pg_tblspc?
Shouldn't that link-following occur automatically when walking PGDATA in
the first place?
2. Added ControlData->fsync_disabled to record whether fsync was ever
disabled while the server was running (tested in CreateCheckPoint)
3. Run fsync_recursively at startup only if fsync is enabled
4. Run it if we're doing crash recovery, or fsync was disabled
5. Use pg_flush_data in pre_sync_fname
6. Issue fsync on directories too
7. Tested that it works if pg_xlog is a symlink (no changes).(In short, everything you mentioned in your earlier mail.)
Note that I set ControlData->fsync_disabled to false in BootstrapXLOG,
but it gets set to true during a later CreateCheckPoint(). This means
we run fsync again at startup after initdb. I'm not sure what to do
about that.
This all looks reasonable to me. I just noticed, though, that
the fd.c routines test enableFsync and do nothing if it's not enabled;
but fsync_recursively goes to all the trouble of doing stuff even if
disabled, and the actions are skipped later; the enableFsync check is
then responsibility of the caller. This seems a bit prone to later
confusion. Maybe fsync_recursively should Assert() that it's only being
called with enableFsync=on; or perhaps we can have it return early if
it's unset.
--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hi �lvaro.
Thanks for taking a look at the patch.
At 2015-04-03 13:32:32 -0300, alvherre@2ndquadrant.com wrote:
But then, since the name is already telling us that it's all about
PGDATA, maybe we can remove the "recursively" part of the name?
Sure, that makes sense too. Since you and Andres both like
�fsync_pgdata(data_directory)�, I'll change it accordingly.
I also noticed that walkdir() thinks it is completely agnostic on
what the passed action is; except that the comment at the bottom talks
about how fsync on directories is important, which seems out of place.
Yes. The function behaves as documented, but the comment is clearly too
specific. I'm not sure where to put it. I could make walkdir() NOT do
it, and instead do it in the caller with the same comment. Thoughts?
Hmm ... Actually, since surely we must follow symlinks everywhere,
why do we have to do this separately for pg_tblspc? Shouldn't that
link-following occur automatically when walking PGDATA in the first
place?
I'm not sure about that (and that's why I've not attached an updated
patch here). The original idea was to follow only those links that we
expect to be in PGDATA.
I suppose it would be easier in terms of the code to follow all links,
but I don't know if it's the right thing. If that's what you think we
should do, I can post a simplified patch.
Maybe fsync_recursively should Assert() that it's only being called
with enableFsync=on; or perhaps we can have it return early if it's
unset.
I prefer the latter. Will change.
-- Abhijit
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Abhijit Menon-Sen wrote:
Hi,
At 2015-04-03 13:32:32 -0300, alvherre@2ndquadrant.com wrote:
I also noticed that walkdir() thinks it is completely agnostic on
what the passed action is; except that the comment at the bottom talks
about how fsync on directories is important, which seems out of place.Yes. The function behaves as documented, but the comment is clearly too
specific. I'm not sure where to put it. I could make walkdir() NOT do
it, and instead do it in the caller with the same comment. Thoughts?
I think it's enough to state in the function comment that the action is
applied to the top element too. Maybe add the fsync comment on the
callsite.
Hmm ... Actually, since surely we must follow symlinks everywhere,
why do we have to do this separately for pg_tblspc? Shouldn't that
link-following occur automatically when walking PGDATA in the first
place?I'm not sure about that (and that's why I've not attached an updated
patch here). The original idea was to follow only those links that we
expect to be in PGDATA.I suppose it would be easier in terms of the code to follow all links,
but I don't know if it's the right thing. If that's what you think we
should do, I can post a simplified patch.
Well, we have many things that can be set as symlinks; some you can have
initdb create the links for you (initdb --xlogdir), others you can move
manually. I think not following all links might lead to impossible-to-
detect bugs such as failing to fsync new pgdata subdirectories we add in
the future, for example.
--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
At 2015-04-06 10:16:36 -0300, alvherre@2ndquadrant.com wrote:
Well, we have many things that can be set as symlinks; some you can
have initdb create the links for you (initdb --xlogdir), others you
can move manually.
Looking at sendDir() in backend/replication/basebackup.c, I notice that
the only places where it expects/allows symlinks are for pg_xlog and in
pg_tblspc.
Still, thanks to the example in basebackup.c, I've got most of a patch
that should follow any symlinks in PGDATA. I just need to test a little
more before I post it.
-- Abhijit
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
At 2015-04-15 11:40:34 +0530, ams@2ndQuadrant.com wrote:
Still, thanks to the example in basebackup.c, I've got most of a patch
that should follow any symlinks in PGDATA.
I notice that src/bin/pg_rewind/copy_fetch.c has a traverse_datadir()
function that does what we want (but it recurses into symlinks only
inside pg_tblspc).
-- Abhijit
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hi.
Here's a variation of the earlier patch that follows all links in
PGDATA. Does this look more like what you had in mind?
-- Abhijit
Attachments:
0001-20150416-Recursively-fsync-PGDATA-at-startup-if-needed.patchtext/x-diff; charset=us-asciiDownload+177-2
On Thu, Apr 16, 2015 at 9:24 AM, Abhijit Menon-Sen <ams@2ndquadrant.com> wrote:
Here's a variation of the earlier patch that follows all links in
PGDATA. Does this look more like what you had in mind?
I'm really confused by the additional control-file field. It is
documented as indicating whether fsync was ever disabled while the
server was running. But:
1. It doesn't do that. As soon as we fsync the data directory, we
reset the flag. That's not what "ever disabled" means to me.
2. I don't know why it's part of this patch. Tracking whether fsync
was ever disabled could be useful forensically, but isn't related to
fsync-ing the data directory after a crash, so I dunno why we'd put
that in this patch. Tracking whether fsync was disabled recently, as
the patch actually does, doesn't seem to be of any obvious value in
preventing corruption either.
Also, it seems awfully unfortunate to me that we're duplicating a
whole pile of code into xlog.c here. Maybe there's no way to avoid
the code duplication, but pre_sync_fname() seems like it'd more
naturally go in fd.c than here. I dunno where walkdir should go, but
again, not in xlog.c.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Robert Haas wrote:
Also, it seems awfully unfortunate to me that we're duplicating a
whole pile of code into xlog.c here. Maybe there's no way to avoid
the code duplication, but pre_sync_fname() seems like it'd more
naturally go in fd.c than here. I dunno where walkdir should go, but
again, not in xlog.c.
Hm, there's an interest in backpatching this as a bugfix, if I
understand correctly; hence the duplicated code. We could remove the
duplicity later with a refactoring patch in master only.
However, now that you mention a pg_control flag, it becomes evident to
me that a change to pg_control cannot be back-patched ...
--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers