END_OF_RECOVERY shutdowns and ResetUnloggedRelations()
Hi,
Abhijit and I investigated a customer problem which has showed that crash recovery +
unlogged relations don't always work well together:
A condensed version of how crash recovery works is:
StartupXLOG()
{
...
if (ControlFile->state != DB_SHUTDOWNED)
InRecovery = true;
if (InRecovery)
{
ResetUnloggedRelations(UNLOGGED_RELATION_CLEANUP);
....
/* perform crash recovery till the end of WAL */
...
CreateCheckPoint(CHECKPOINT_END_OF_RECOVERY | CHECKPOINT_IMMEDIATE);
...
}
PreallocXlogFiles(EndOfLog);
if (InRecovery)
ResetUnloggedRelations(UNLOGGED_RELATION_INIT);
...
/* finish startup */
}
the second important part is:
CreateCheckPoint(flags)
{
...
if (flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY))
shutdown = true;
...
if (shutdown)
ControlFile->state = DB_SHUTDOWNED;
UpdateControlFile();
...
}
If you consider a crash due to ENOSPC the problem is that first crash
recovery is performed. Then a checkpoint is performed which is marked as
END_OF_RECOVERY - which marks the database as being properly shut down!
So far, while not pretty, so good. The problem is that if we crash after
the CreateCheckPoint(), e.g. because of xlog preallocation or the new
files created in ResetUnloggedRelations(), the server will restart *but*
will *not* perform crash recovery anymore as pg_control marked as
DB_SHUTDOWNED!
That leaves you with a database which has all the _init forks, but not
the main forks... Leading to scary an unexpected errors.
Should somebody google this: The problem can be fixed by forcing the
server into crash recovery again using an immediate shutdown.
Personally I think it's quite the mistake that END_OF_RECOVERY
checkpoints are treated as shutdown checkpoints. The claimed reason
that:
*
* Note that we write a shutdown checkpoint rather than an on-line
* one. This is not particularly critical, but since we may be
* assigning a new TLI, using a shutdown checkpoint allows us to have
* the rule that TLI only changes in shutdown checkpoints, which
* allows some extra error checking in xlog_redo.
*
and
/*
* An end-of-recovery checkpoint is really a shutdown checkpoint, just
* issued at a different time.
*/
isn't very convincing as those checks could just as well be saved in a
flags argument in the checkpoint. The likelihood of this confusion
causing further bugs (IIRC it already has caused a couple) seems high.
What I like even less is that pg_control is actually marked as
DB_SHUTDOWNED due to END_OF_RECOVERY. That's just plain wrong. Obviously
the database was *NOT* shutdown peacefully. I don't see active bugs due
it besides this, but I think it's likely to either have or create futher
ones.
Because at least the former is something that obviously we can't (and
don't want) to change in the back branches I think the solution for this
particular problem is to simply move the ResetUnloggedRelations() call a
couple lines up to before the CreateCheckPoint(). That should fix this.
Comments, other opinions?
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hi,
On 2014-09-12 13:22:46 +0200, Andres Freund wrote:
Because at least the former is something that obviously we can't (and
don't want) to change in the back branches I think the solution for this
particular problem is to simply move the ResetUnloggedRelations() call a
couple lines up to before the CreateCheckPoint(). That should fix this.Comments, other opinions?
A second thing I realized, just after hitting send, is that I think
ResetUnloggedRelations(UNLOGGED_RELATION_INIT) actually has to fsync the
created files. As we rely on the !init relations being there they really
need to be fsynced. During normal running that's ensured via the smgr
during checkpoints, but that's not the case here.
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Sep 12, 2014 at 7:22 AM, Andres Freund <andres@2ndquadrant.com> wrote:
What I like even less is that pg_control is actually marked as
DB_SHUTDOWNED due to END_OF_RECOVERY. That's just plain wrong. Obviously
the database was *NOT* shutdown peacefully. I don't see active bugs due
it besides this, but I think it's likely to either have or create futher
ones.
I agree. The database clearly isn't shut down at end of recovery;
it's still active and we're still doing things to it. If we crash at
that point, we need to go back into recovery on restart. This seems
open and shut, but maybe I'm missing something. Why shouldn't we fix
*that*?
With regard to your second email, I agree that
ResetUnloggedRelation(UNLOGGED_RELATION_INIT) needs to issue fsyncs.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2014-09-12 14:44:48 -0400, Robert Haas wrote:
On Fri, Sep 12, 2014 at 7:22 AM, Andres Freund <andres@2ndquadrant.com> wrote:
What I like even less is that pg_control is actually marked as
DB_SHUTDOWNED due to END_OF_RECOVERY. That's just plain wrong. Obviously
the database was *NOT* shutdown peacefully. I don't see active bugs due
it besides this, but I think it's likely to either have or create futher
ones.I agree. The database clearly isn't shut down at end of recovery;
it's still active and we're still doing things to it. If we crash at
that point, we need to go back into recovery on restart. This seems
open and shut, but maybe I'm missing something. Why shouldn't we fix
*that*?
Well, I think we might want to do both. There doesn't seem to be a
reason to *not* do the ResetUnloggedRelation(UNLOGGED_RELATION_INIT)
around the ShutdownWalRcv(). That seems much closer where it, for me,
logically belongs. And it'd fix the concrete problem.
For DB_SHUTDOWNED (is that actually a word? Looks like it could be from
me...) the case isn't that clear:
If you start a node after a crash and stop it as soon as it finished, it
doesn't need recovery again. Similar if a node is promoted and doesn't
use fast promotion or a older release. Now, I think this is a pretty
dubious benefit. But I'm not sure it's wise to change it in the back
branches.
With regard to your second email, I agree that
ResetUnloggedRelation(UNLOGGED_RELATION_INIT) needs to issue fsyncs.
Good.
The topic reminds me of the fact that I actually think at the very least
pg_xlog/ and pg_control needs the same treatment. Consider the following
sequence:
1) postgres writes loads of stuff. Lots of it to *not* fsynced WAL
segments
2) postgres crashes and crash recovery happens. Replays *not* fsynced
WAL
3) the OS crashes
4) Bad. We now might hava pg_control with a minRecovery that's *later*
than some potentially unsynced WAL segments
I think the easiest would be to just fsync() the entire data directory
at the start when ControlFile->state !=
DB_SHUTDOWNED/DB_SHUTDOWNED_IN_RECOVERY
Note that that's independent of the fsync for unlogged relations.
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Sep 18, 2014 at 4:31 AM, Andres Freund <andres@2ndquadrant.com> wrote:
On 2014-09-12 14:44:48 -0400, Robert Haas wrote:
On Fri, Sep 12, 2014 at 7:22 AM, Andres Freund <andres@2ndquadrant.com> wrote:
What I like even less is that pg_control is actually marked as
DB_SHUTDOWNED due to END_OF_RECOVERY. That's just plain wrong. Obviously
the database was *NOT* shutdown peacefully. I don't see active bugs due
it besides this, but I think it's likely to either have or create futher
ones.I agree. The database clearly isn't shut down at end of recovery;
it's still active and we're still doing things to it. If we crash at
that point, we need to go back into recovery on restart. This seems
open and shut, but maybe I'm missing something. Why shouldn't we fix
*that*?Well, I think we might want to do both. There doesn't seem to be a
reason to *not* do the ResetUnloggedRelation(UNLOGGED_RELATION_INIT)
around the ShutdownWalRcv(). That seems much closer where it, for me,
logically belongs. And it'd fix the concrete problem.
That might be all right. I've booted enough things in this area of
the code to not have a lot of confidence in my ability to not boot
things in this area of the code ... but I don't see a problem with it.
It does seem a little odd to do that before checking whether we
reached the minimum recovery point, or deciding whether to assign a
new TLI, but I don't see a concrete problem with putting it there.
With regard to your second email, I agree that
ResetUnloggedRelation(UNLOGGED_RELATION_INIT) needs to issue fsyncs.Good.
The topic reminds me of the fact that I actually think at the very least
pg_xlog/ and pg_control needs the same treatment. Consider the following
sequence:
1) postgres writes loads of stuff. Lots of it to *not* fsynced WAL
segments
2) postgres crashes and crash recovery happens. Replays *not* fsynced
WAL
3) the OS crashes
4) Bad. We now might hava pg_control with a minRecovery that's *later*
than some potentially unsynced WAL segmentsI think the easiest would be to just fsync() the entire data directory
at the start when ControlFile->state !=
DB_SHUTDOWNED/DB_SHUTDOWNED_IN_RECOVERYNote that that's independent of the fsync for unlogged relations.
Seems reasonable.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hi Andres, Robert.
I've attached four patches here.
1. Move the call to ResetUnloggedRelations(UNLOGGED_RELATION_INIT) to
earlier in StartupXLOG.
2. Inside that function, issue fsync()s for the main forks we create by
copying the _init fork.
3. A small fixup to add a const to "typedef char *FileName", because the
earlier patch gave me warnings about discarding const-ness. This is
consistent with many other functions in fd.c that take const char *.
4. Issue an fsync() on the data directory at startup if we need to
perform crash recovery.
I created some unlogged relations, performed an immediate shutdown, and
then straced postgres as it performed crash recovery. The changes in (2)
do indeed fsync the files we create by copying *_init, and don't fsync
anything else (at least not after I fixed a bug ;-).
I did not do anything about the END_OF_RECOVERY checkpoint setting the
ControlFile->state to DB_SHUTDOWNED, because it wasn't clear to me if
there was any agreement on what to do. I would be happy to submit a
followup patch if we reach some decision about it.
Is this what you had in mind?
-- Abhijit
Attachments:
0003-Add-const-to-FileName-typedef-make-fsync_fname-use-i.patchtext/x-diff; charset=us-asciiDownload
>From 2b7eb360cd8789bb75cc9c02b85b1df6a9903f0f Mon Sep 17 00:00:00 2001
From: Abhijit Menon-Sen <ams@2ndQuadrant.com>
Date: Wed, 24 Sep 2014 16:07:06 +0530
Subject: Add const to FileName typedef, make fsync_fname use it
The filename arguments aren't modified anyway, and this avoids warnings
when fsync_fname is called from ResetUnloggedRelationsInDbspaceDir. It
is also consistent with several other functions in fd.c that already
take const char * arguments.
---
src/backend/storage/file/fd.c | 2 +-
src/include/storage/fd.h | 4 ++--
2 files changed, 3 insertions(+), 3 deletions(-)
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 1f69c9e..33ab471 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -390,7 +390,7 @@ pg_flush_data(int fd, off_t offset, off_t amount)
* indicate the OS just doesn't allow/require fsyncing directories.
*/
void
-fsync_fname(char *fname, bool isdir)
+fsync_fname(FileName fname, bool isdir)
{
int fd;
int returncode;
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index a6df8fb..2fa4e18 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -46,7 +46,7 @@
* FileSeek uses the standard UNIX lseek(2) flags.
*/
-typedef char *FileName;
+typedef const char *FileName;
typedef int File;
@@ -113,7 +113,7 @@ extern int pg_fsync_no_writethrough(int fd);
extern int pg_fsync_writethrough(int fd);
extern int pg_fdatasync(int fd);
extern int pg_flush_data(int fd, off_t offset, off_t amount);
-extern void fsync_fname(char *fname, bool isdir);
+extern void fsync_fname(FileName fname, bool isdir);
/* Filename components for OpenTemporaryFile */
#define PG_TEMP_FILES_DIR "pgsql_tmp"
--
1.9.1
0004-If-we-need-to-perform-crash-recovery-fsync-the-data-.patchtext/x-diff; charset=us-asciiDownload
>From bc07b596b8778258da661d7c9aea1f4c0d00a2e0 Mon Sep 17 00:00:00 2001
From: Abhijit Menon-Sen <ams@2ndQuadrant.com>
Date: Wed, 24 Sep 2014 16:20:43 +0530
Subject: If we need to perform crash recovery, fsync the data directory
This is so that we don't lose older unflushed writes in the event of a
power failure after crash recovery, where more recent writes are
preserved.
See 20140918083148.GA17265@alap3.anarazel.de for details.
---
src/backend/access/transam/xlog.c | 12 ++++++++++++
1 file changed, 12 insertions(+)
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 218f7fb..95d57cb 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5973,6 +5973,18 @@ StartupXLOG(void)
ereport(FATAL,
(errmsg("control file contains invalid data")));
+ /*
+ * If we need to perform crash recovery, we issue an fsync on the
+ * data directory to try to ensure that any data written before the
+ * crash are flushed to disk. Otherwise a power failure in the near
+ * future might mean that earlier unflushed writes are lost, but the
+ * more recent data written to disk from here on are persisted.
+ */
+
+ if (ControlFile->state != DB_SHUTDOWNED &&
+ ControlFile->state != DB_SHUTDOWNED_IN_RECOVERY)
+ fsync_fname(data_directory, true);
+
if (ControlFile->state == DB_SHUTDOWNED)
{
/* This is the expected case, so don't be chatty in standalone mode */
--
1.9.1
0001-Call-ResetUnloggedRelations-UNLOGGED_RELATION_INIT-e.patchtext/x-diff; charset=us-asciiDownload
>From d8726c06cdf11674661eac1d091cf7edd05c2a0c Mon Sep 17 00:00:00 2001
From: Abhijit Menon-Sen <ams@2ndQuadrant.com>
Date: Wed, 24 Sep 2014 14:43:18 +0530
Subject: Call ResetUnloggedRelations(UNLOGGED_RELATION_INIT) earlier
We need to call this after recovery, but not after the END_OF_RECOVERY
checkpoint is written. If we crash after that checkpoint, for example
because of ENOSPC in PreallocXlogFiles, the checkpoint has already set
the ControlFile->state to DB_SHUTDOWNED, so we don't enter recovery
again at startup.
Because we did call ResetUnloggedRelations(UNLOGGED_RELATION_CLEANUP)
in the first cleanup, this leaves the database with a bunch of _init
forks for unlogged relations, but no main forks, leading to scary
errors.
See thread from 20140912112246.GA4984@alap3.anarazel.de for details.
---
src/backend/access/transam/xlog.c | 16 ++++++++--------
1 file changed, 8 insertions(+), 8 deletions(-)
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 46eef5f..218f7fb 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6863,6 +6863,14 @@ StartupXLOG(void)
ShutdownWalRcv();
/*
+ * Reset initial contents of unlogged relations. This has to be done
+ * AFTER recovery is complete so that any unlogged relations created
+ * during recovery also get picked up.
+ */
+ if (InRecovery)
+ ResetUnloggedRelations(UNLOGGED_RELATION_INIT);
+
+ /*
* We don't need the latch anymore. It's not strictly necessary to disown
* it, but let's do it for the sake of tidiness.
*/
@@ -7130,14 +7138,6 @@ StartupXLOG(void)
PreallocXlogFiles(EndOfLog);
/*
- * Reset initial contents of unlogged relations. This has to be done
- * AFTER recovery is complete so that any unlogged relations created
- * during recovery also get picked up.
- */
- if (InRecovery)
- ResetUnloggedRelations(UNLOGGED_RELATION_INIT);
-
- /*
* Okay, we're officially UP.
*/
InRecovery = false;
--
1.9.1
0002-Make-ResetUnloggedRelations-_INIT-fsync-newly-create.patchtext/x-diff; charset=us-asciiDownload
>From 7eba57b5ed9e1eda1fa1b14b32a82828617d823e Mon Sep 17 00:00:00 2001
From: Abhijit Menon-Sen <ams@2ndQuadrant.com>
Date: Wed, 24 Sep 2014 16:01:37 +0530
Subject: =?UTF-8?q?Make=20ResetUnloggedRelations(=E2=80=A6=5FINIT)=20fsync?=
=?UTF-8?q?=20newly-created=20main=20forks?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
During UNLOGGED_RELATION_INIT, after we have copied ${x}_init to $x, we
issue fsync()s for the newly-created files. We depend on their existence
and a checkpoint isn't going to fsync them for us during recovery.
See thread from 20140912112246.GA4984@alap3.anarazel.de for details.
---
src/backend/storage/file/reinit.c | 36 ++++++++++++++++++++++++++++++++++++
1 file changed, 36 insertions(+)
diff --git a/src/backend/storage/file/reinit.c b/src/backend/storage/file/reinit.c
index 3229f41..4febf6f 100644
--- a/src/backend/storage/file/reinit.c
+++ b/src/backend/storage/file/reinit.c
@@ -339,6 +339,42 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
}
FreeDir(dbspace_dir);
+
+ /*
+ * copy_file() above has already called pg_flush_data() on the
+ * files it created. Now we need to fsync those files, because
+ * a checkpoint won't do it for us while we're in recovery.
+ */
+
+ dbspace_dir = AllocateDir(dbspacedirname);
+ while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
+ {
+ ForkNumber forkNum;
+ int oidchars;
+ char oidbuf[OIDCHARS + 1];
+ char mainpath[MAXPGPATH];
+
+ /* Skip anything that doesn't look like a relation data file. */
+ if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars,
+ &forkNum))
+ continue;
+
+ /* Also skip it unless this is the init fork. */
+ if (forkNum != INIT_FORKNUM)
+ continue;
+
+ /* Construct main fork pathname. */
+ memcpy(oidbuf, de->d_name, oidchars);
+ oidbuf[oidchars] = '\0';
+ snprintf(mainpath, sizeof(mainpath), "%s/%s%s",
+ dbspacedirname, oidbuf, de->d_name + oidchars + 1 +
+ strlen(forkNames[INIT_FORKNUM]));
+
+ fsync_fname(mainpath, false);
+ }
+ FreeDir(dbspace_dir);
+
+ fsync_fname(dbspacedirname, true);
}
}
--
1.9.1
Hi,
On 2014-09-24 17:06:05 +0530, Abhijit Menon-Sen wrote:
Hi Andres, Robert.
I've attached four patches here.
Cool. Will review.
1. Move the call to ResetUnloggedRelations(UNLOGGED_RELATION_INIT) to
earlier in StartupXLOG.2. Inside that function, issue fsync()s for the main forks we create by
copying the _init fork.
These two imo should definitely be backpatched.
3. A small fixup to add a const to "typedef char *FileName", because the
earlier patch gave me warnings about discarding const-ness. This is
consistent with many other functions in fd.c that take const char *.
I'm happy with consider that one for master (although I seem to recall having had
a patch for it rejected?), but I don't think we want to do that in the
backbranches.
4. Issue an fsync() on the data directory at startup if we need to
perform crash recovery.
I personally think this one should be backpatched too. Does anyone
believe differently?
From d8726c06cdf11674661eac1d091cf7edd05c2a0c Mon Sep 17 00:00:00 2001
From: Abhijit Menon-Sen <ams@2ndQuadrant.com>
Date: Wed, 24 Sep 2014 14:43:18 +0530
Subject: Call ResetUnloggedRelations(UNLOGGED_RELATION_INIT) earlierWe need to call this after recovery, but not after the END_OF_RECOVERY
checkpoint is written. If we crash after that checkpoint, for example
because of ENOSPC in PreallocXlogFiles, the checkpoint has already set
the ControlFile->state to DB_SHUTDOWNED, so we don't enter recovery
again at startup.Because we did call ResetUnloggedRelations(UNLOGGED_RELATION_CLEANUP)
in the first cleanup, this leaves the database with a bunch of _init
forks for unlogged relations, but no main forks, leading to scary
errors.See thread from 20140912112246.GA4984@alap3.anarazel.de for details.
With a explanation. Shiny!
---
src/backend/access/transam/xlog.c | 16 ++++++++--------
1 file changed, 8 insertions(+), 8 deletions(-)diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 46eef5f..218f7fb 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -6863,6 +6863,14 @@ StartupXLOG(void) ShutdownWalRcv();/* + * Reset initial contents of unlogged relations. This has to be done + * AFTER recovery is complete so that any unlogged relations created + * during recovery also get picked up. + */ + if (InRecovery) + ResetUnloggedRelations(UNLOGGED_RELATION_INIT); +
+
* Also recovery shouldn't be regarded successful if the reset fails -
* e.g. because of ENOSPC.
+ + /* + * copy_file() above has already called pg_flush_data() on the + * files it created. Now we need to fsync those files, because + * a checkpoint won't do it for us while we're in recovery. + */
+
* Doing this in a separate pass is advantageous for performance reasons
* because it allows the kernel to perform all the flushes at once.
+ /* + * If we need to perform crash recovery, we issue an fsync on the + * data directory to try to ensure that any data written before the + * crash are flushed to disk. Otherwise a power failure in the near + * future might mean that earlier unflushed writes are lost, but the + * more recent data written to disk from here on are persisted. + */ + + if (ControlFile->state != DB_SHUTDOWNED && + ControlFile->state != DB_SHUTDOWNED_IN_RECOVERY) + fsync_fname(data_directory, true); + if (ControlFile->state == DB_SHUTDOWNED) { /* This is the expected case, so don't be chatty in standalone mode */ --
Unless I miss something this isn't sufficient. We need to fsync the
files in the data directory, not just the toplevel directory?
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
At 2014-09-25 22:41:18 +0200, andres@2ndquadrant.com wrote:
* Also recovery shouldn't be regarded successful if the reset fails -
* e.g. because of ENOSPC.
OK.
* Doing this in a separate pass is advantageous for performance reasons
* because it allows the kernel to perform all the flushes at once.
OK.
Unless I miss something this isn't sufficient. We need to fsync the
files in the data directory, not just the toplevel directory?
No, of course you're right. So a separate function that does the moral
equivalent of "find $PGDATA -exec fsync_fname …"?
Will resubmit with the additional comments.
-- Abhijit
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2014-09-26 02:34:06 +0530, Abhijit Menon-Sen wrote:
At 2014-09-25 22:41:18 +0200, andres@2ndquadrant.com wrote:
Unless I miss something this isn't sufficient. We need to fsync the
files in the data directory, not just the toplevel directory?No, of course you're right. So a separate function that does the moral
equivalent of "find $PGDATA -exec fsync_fname …"?
Probably will require some care to deal correctly with tablespaces.
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
At 2014-09-25 22:41:18 +0200, andres@2ndquadrant.com wrote:
On 2014-09-24 17:06:05 +0530, Abhijit Menon-Sen wrote:
1. Move the call to ResetUnloggedRelations(UNLOGGED_RELATION_INIT) to
earlier in StartupXLOG.2. Inside that function, issue fsync()s for the main forks we create by
copying the _init fork.These two imo should definitely be backpatched.
I've attached updated versions of these two patches. The only changes
are to revise the comments as you suggested.
3. A small fixup to add a const to "typedef char *FileName", because the
earlier patch gave me warnings about discarding const-ness. This is
consistent with many other functions in fd.c that take const char *.I'm happy with consider that one for master (although I seem to recall
having had a patch for it rejected?), but I don't think we want to do
that in the backbranches.
(I gave up on this for now as an unrelated diversion.)
4. Issue an fsync() on the data directory at startup if we need to
perform crash recovery.I personally think this one should be backpatched too. Does anyone
believe differently?
I revised this patch as well, I'll rebase and send it separately along
with the initdb -S patch shortly.
-- Abhijit
Attachments:
0001-Call-ResetUnloggedRelations-UNLOGGED_RELATION_INIT-e.patchtext/x-diff; charset=us-asciiDownload
>From 8487185a5f2073ed475f971e6203d49e1e21a987 Mon Sep 17 00:00:00 2001
From: Abhijit Menon-Sen <ams@2ndQuadrant.com>
Date: Wed, 8 Oct 2014 11:48:25 +0530
Subject: Call ResetUnloggedRelations(UNLOGGED_RELATION_INIT) earlier
We need to call this after recovery, but not after the END_OF_RECOVERY
checkpoint is written. If we crash after that checkpoint, for example
because of ENOSPC in PreallocXlogFiles, the checkpoint has already set
the ControlFile->state to DB_SHUTDOWNED, so we don't enter recovery
again at startup.
Because we did call ResetUnloggedRelations(UNLOGGED_RELATION_CLEANUP)
in the first cleanup, this leaves the database with a bunch of _init
forks for unlogged relations, but no main forks, leading to scary
errors.
See thread from 20140912112246.GA4984@alap3.anarazel.de for details.
---
src/backend/access/transam/xlog.c | 18 ++++++++++--------
1 file changed, 10 insertions(+), 8 deletions(-)
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 3c9aeae..2ab9da5 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6871,6 +6871,16 @@ StartupXLOG(void)
ShutdownWalRcv();
/*
+ * Reset unlogged relations to the contents of their INIT fork. This
+ * is done AFTER recovery is complete so as to include any unlogged
+ * relations created during recovery, but BEFORE recovery is marked
+ * as having completed successfully (because this step can fail,
+ * e.g. due to ENOSPC).
+ */
+ if (InRecovery)
+ ResetUnloggedRelations(UNLOGGED_RELATION_INIT);
+
+ /*
* We don't need the latch anymore. It's not strictly necessary to disown
* it, but let's do it for the sake of tidiness.
*/
@@ -7138,14 +7148,6 @@ StartupXLOG(void)
PreallocXlogFiles(EndOfLog);
/*
- * Reset initial contents of unlogged relations. This has to be done
- * AFTER recovery is complete so that any unlogged relations created
- * during recovery also get picked up.
- */
- if (InRecovery)
- ResetUnloggedRelations(UNLOGGED_RELATION_INIT);
-
- /*
* Okay, we're officially UP.
*/
InRecovery = false;
--
1.9.1
0002-Make-ResetUnloggedRelations-_INIT-fsync-newly-create.patchtext/x-diff; charset=us-asciiDownload
>From 944e7c4e27fca5589b8a103f7f470df23a5161c2 Mon Sep 17 00:00:00 2001
From: Abhijit Menon-Sen <ams@2ndQuadrant.com>
Date: Wed, 24 Sep 2014 16:01:37 +0530
Subject: =?UTF-8?q?Make=20ResetUnloggedRelations(=E2=80=A6=5FINIT)=20fsync?=
=?UTF-8?q?=20newly-created=20main=20forks?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
During UNLOGGED_RELATION_INIT, after we have copied ${x}_init to $x, we
issue fsync()s for the newly-created files. We depend on their existence
and a checkpoint isn't going to fsync them for us during recovery.
See thread from 20140912112246.GA4984@alap3.anarazel.de for details.
---
src/backend/storage/file/reinit.c | 38 ++++++++++++++++++++++++++++++++++++++
1 file changed, 38 insertions(+)
diff --git a/src/backend/storage/file/reinit.c b/src/backend/storage/file/reinit.c
index 3229f41..4ad5987 100644
--- a/src/backend/storage/file/reinit.c
+++ b/src/backend/storage/file/reinit.c
@@ -339,6 +339,44 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
}
FreeDir(dbspace_dir);
+
+ /*
+ * copy_file() above has already called pg_flush_data() on the
+ * files it created. Now we need to fsync those files, because
+ * a checkpoint won't do it for us while we're in recovery. We
+ * do this in a separate pass to allow the kernel to perform
+ * all the flushes at once.
+ */
+
+ dbspace_dir = AllocateDir(dbspacedirname);
+ while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
+ {
+ ForkNumber forkNum;
+ int oidchars;
+ char oidbuf[OIDCHARS + 1];
+ char mainpath[MAXPGPATH];
+
+ /* Skip anything that doesn't look like a relation data file. */
+ if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars,
+ &forkNum))
+ continue;
+
+ /* Also skip it unless this is the init fork. */
+ if (forkNum != INIT_FORKNUM)
+ continue;
+
+ /* Construct main fork pathname. */
+ memcpy(oidbuf, de->d_name, oidchars);
+ oidbuf[oidchars] = '\0';
+ snprintf(mainpath, sizeof(mainpath), "%s/%s%s",
+ dbspacedirname, oidbuf, de->d_name + oidchars + 1 +
+ strlen(forkNames[INIT_FORKNUM]));
+
+ fsync_fname(mainpath, false);
+ }
+ FreeDir(dbspace_dir);
+
+ fsync_fname((char *)dbspacedirname, true);
}
}
--
1.9.1
On 2014-10-27 16:09:30 +0530, Abhijit Menon-Sen wrote:
At 2014-09-25 22:41:18 +0200, andres@2ndquadrant.com wrote:
On 2014-09-24 17:06:05 +0530, Abhijit Menon-Sen wrote:
1. Move the call to ResetUnloggedRelations(UNLOGGED_RELATION_INIT) to
earlier in StartupXLOG.2. Inside that function, issue fsync()s for the main forks we create by
copying the _init fork.These two imo should definitely be backpatched.
I've attached updated versions of these two patches. The only changes
are to revise the comments as you suggested.
Committed and backpatched (to 9.1) these. That required also
backpatching the move of fsync_fname() to fd.c/h to 9.1-9.3.
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers