Potential data loss of 2PC files
Hi all,
2PC files are created using RecreateTwoPhaseFile() in two places currently:
- at replay on a XLOG_XACT_PREPARE record.
- At checkpoint with CheckPointTwoPhase().
Now RecreateTwoPhaseFile() is careful to call pg_fsync() to be sure
that the 2PC files find their way into disk. But one piece is missing:
the parent directory pg_twophase is not fsync'd. At replay this is
more sensitive if there is a PREPARE record followed by a checkpoint
record. If there is a power failure after the checkpoint completes
there is a risk to lose 2PC status files here.
It seems to me that we really should have CheckPointTwoPhase() call
fsync() on pg_twophase to be sure that no files are lost here. There
is no point to do this operation in RecreateTwoPhaseFile() as if there
are many 2PC transactions to replay performance would be impacted, and
we don't care about the durability of those files until a checkpoint
moves the redo pointer. I have drafted the patch attached to address
this issue.
I am adding that as well to the next CF for consideration.
Thoughts?
--
Michael
Attachments:
2pc-loss.patchapplication/x-download; name=2pc-loss.patchDownload
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 5415604993..192ee065d5 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -1651,6 +1651,16 @@ CheckPointTwoPhase(XLogRecPtr redo_horizon)
"for long-running prepared transactions",
serialized_xacts,
serialized_xacts)));
+
+ /*
+ * Make sure that the content created is persistent on disk to prevent
+ * data loss in case of an OS crash or a power failure. Each 2PC file
+ * has been already created and flushed to disk individually by
+ * RecreateTwoPhaseFile() using the list of GXACTs available for
+ * normal processing as well as at recovery when replaying individually
+ * each XLOG_XACT_PREPARE record.
+ */
+ fsync_fname(TWOPHASE_DIR, true);
}
/*
On Wed, Dec 21, 2016 at 8:30 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
Hi all,
2PC files are created using RecreateTwoPhaseFile() in two places currently:
- at replay on a XLOG_XACT_PREPARE record.
- At checkpoint with CheckPointTwoPhase().Now RecreateTwoPhaseFile() is careful to call pg_fsync() to be sure
that the 2PC files find their way into disk. But one piece is missing:
the parent directory pg_twophase is not fsync'd. At replay this is
more sensitive if there is a PREPARE record followed by a checkpoint
record. If there is a power failure after the checkpoint completes
there is a risk to lose 2PC status files here.It seems to me that we really should have CheckPointTwoPhase() call
fsync() on pg_twophase to be sure that no files are lost here. There
is no point to do this operation in RecreateTwoPhaseFile() as if there
are many 2PC transactions to replay performance would be impacted, and
we don't care about the durability of those files until a checkpoint
moves the redo pointer. I have drafted the patch attached to address
this issue.I am adding that as well to the next CF for consideration.
It's pretty stupid that operating systems don't guarantee that calling
pg_fsync() on the file, the file will also be visible in the parent
directory. There's not much use case for wanting to make sure that
the file will have the correct contents but not caring whether it's
there at all.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On December 22, 2016 5:50:38 PM GMT+01:00, Robert Haas <robertmhaas@gmail.com> wrote:
On Wed, Dec 21, 2016 at 8:30 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:Hi all,
2PC files are created using RecreateTwoPhaseFile() in two places
currently:
- at replay on a XLOG_XACT_PREPARE record.
- At checkpoint with CheckPointTwoPhase().Now RecreateTwoPhaseFile() is careful to call pg_fsync() to be sure
that the 2PC files find their way into disk. But one piece ismissing:
the parent directory pg_twophase is not fsync'd. At replay this is
more sensitive if there is a PREPARE record followed by a checkpoint
record. If there is a power failure after the checkpoint completes
there is a risk to lose 2PC status files here.It seems to me that we really should have CheckPointTwoPhase() call
fsync() on pg_twophase to be sure that no files are lost here. There
is no point to do this operation in RecreateTwoPhaseFile() as ifthere
are many 2PC transactions to replay performance would be impacted,
and
we don't care about the durability of those files until a checkpoint
moves the redo pointer. I have drafted the patch attached to address
this issue.I am adding that as well to the next CF for consideration.
It's pretty stupid that operating systems don't guarantee that calling
pg_fsync() on the file, the file will also be visible in the parent
directory. There's not much use case for wanting to make sure that
the file will have the correct contents but not caring whether it's
there at all.
It makes more sense of you mentally separate between filename(s) and file contents. Having to do filesystem metatata transactions for an fsync intended to sync contents would be annoying...
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Dec 22, 2016 at 12:39 PM, Andres Freund <andres@anarazel.de> wrote:
It makes more sense of you mentally separate between filename(s) and file contents. Having to do filesystem metatata transactions for an fsync intended to sync contents would be annoying...
I thought that's why there's fdatasync.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On December 22, 2016 6:44:22 PM GMT+01:00, Robert Haas <robertmhaas@gmail.com> wrote:
On Thu, Dec 22, 2016 at 12:39 PM, Andres Freund <andres@anarazel.de>
wrote:It makes more sense of you mentally separate between filename(s) and
file contents. Having to do filesystem metatata transactions for an
fsync intended to sync contents would be annoying...I thought that's why there's fdatasync.
Not quite IIRC: that doesn't deal with file size increase. All this would be easier if hardlinks wouldn't exist IIUC. It's basically a question whether dentry, inode or contents need to be synced. Yes, it sucks.
Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 12/22/16 12:02 PM, Andres Freund wrote:
On December 22, 2016 6:44:22 PM GMT+01:00, Robert Haas <robertmhaas@gmail.com> wrote:
On Thu, Dec 22, 2016 at 12:39 PM, Andres Freund <andres@anarazel.de>
wrote:It makes more sense of you mentally separate between filename(s) and
file contents. Having to do filesystem metatata transactions for an
fsync intended to sync contents would be annoying...I thought that's why there's fdatasync.
Not quite IIRC: that doesn't deal with file size increase. All this would be easier if hardlinks wouldn't exist IIUC. It's basically a question whether dentry, inode or contents need to be synced. Yes, it sucks.
IIRC this isn't the first time we've run into this problem... should
pg_fsync() automatically fsync the directory as well? I guess we'd need
a flag to disable that for performance critical areas where we know we
don't need it (presumably just certain WAL fsyncs).
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Dec 23, 2016 at 6:33 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
On 12/22/16 12:02 PM, Andres Freund wrote:
On December 22, 2016 6:44:22 PM GMT+01:00, Robert Haas
<robertmhaas@gmail.com> wrote:On Thu, Dec 22, 2016 at 12:39 PM, Andres Freund <andres@anarazel.de>
wrote:It makes more sense of you mentally separate between filename(s) and
file contents. Having to do filesystem metatata transactions for an
fsync intended to sync contents would be annoying...I thought that's why there's fdatasync.
Not quite IIRC: that doesn't deal with file size increase. All this would
be easier if hardlinks wouldn't exist IIUC. It's basically a question
whether dentry, inode or contents need to be synced. Yes, it sucks.IIRC this isn't the first time we've run into this problem... should
pg_fsync() automatically fsync the directory as well? I guess we'd need a
flag to disable that for performance critical areas where we know we don't
need it (presumably just certain WAL fsyncs).
I am not sure if that would be performance-wise. The case of the 2PC
files is quite special anyway as just doing the sync at checkpoint
phase for everything would be enough.
--
Michael
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Dec 23, 2016 at 3:02 AM, Andres Freund <andres@anarazel.de> wrote:
Not quite IIRC: that doesn't deal with file size increase. All this would be easier if hardlinks wouldn't exist IIUC. It's basically a question whether dentry, inode or contents need to be synced. Yes, it sucks.
I did more monitoring of the code... Looking at unlogged tables and
empty() routines of access methods, isn't there a risk as well for
unlogged tables? mdimmedsync() does not fsync() the parent directory
either! It seems to me that we want to fsync() the parent folder in
this code path especially and not just at checkpoint as this assumes
that it does not care about shared buffers. We could do that at
checkpoint as well, actually, by looping through all the tablespaces
and fsync the database directories.
Robert, as the former author of unlogged tables and Andres, as you
have done a lot of work on durability, could you share your opinion on
the matter? Of course opinions of others are welcome!
--
Michael
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2016-12-27 14:09:05 +0900, Michael Paquier wrote:
On Fri, Dec 23, 2016 at 3:02 AM, Andres Freund <andres@anarazel.de> wrote:
Not quite IIRC: that doesn't deal with file size increase. All this would be easier if hardlinks wouldn't exist IIUC. It's basically a question whether dentry, inode or contents need to be synced. Yes, it sucks.
I did more monitoring of the code... Looking at unlogged tables and
empty() routines of access methods, isn't there a risk as well for
unlogged tables? mdimmedsync() does not fsync() the parent directory
either!
But the files aren't created there, so I don't generally see the
problem. And the creation of the metapages etc. should be WAL logged
anyway. So I don't think we should / need to do anything special
there. You can argue however that we wouldn't necessarily fsync the
parent directory for the file creation, ever. But that'd be more
smgrcreate's responsibility than anything.
We could do that at checkpoint as well, actually, by looping through
all the tablespaces and fsync the database directories.
That seems like a bad idea, as it'd force fsyncs on unlogged tables and
such.
Regards,
Andres
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Dec 22, 2016 at 7:00 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
Hi all,
2PC files are created using RecreateTwoPhaseFile() in two places currently:
- at replay on a XLOG_XACT_PREPARE record.
- At checkpoint with CheckPointTwoPhase().Now RecreateTwoPhaseFile() is careful to call pg_fsync() to be sure
that the 2PC files find their way into disk. But one piece is missing:
the parent directory pg_twophase is not fsync'd. At replay this is
more sensitive if there is a PREPARE record followed by a checkpoint
record. If there is a power failure after the checkpoint completes
there is a risk to lose 2PC status files here.It seems to me that we really should have CheckPointTwoPhase() call
fsync() on pg_twophase to be sure that no files are lost here. There
is no point to do this operation in RecreateTwoPhaseFile() as if there
are many 2PC transactions to replay performance would be impacted, and
we don't care about the durability of those files until a checkpoint
moves the redo pointer. I have drafted the patch attached to address
this issue.
I agree with this.
If no prepared transactions were required to be fsynced
CheckPointTwoPhase(), do we want to still fsync the directory?
Probably not.
May be you want to call fsync_fname(TWOPHASE_DIR, true); if
serialized_xacts > 0.
The comment just states what has been done earlier in the function,
which is documented in the prologue as well.
/*
* Make sure that the content created is persistent on disk to prevent
* data loss in case of an OS crash or a power failure. Each 2PC file
* has been already created and flushed to disk individually by
* RecreateTwoPhaseFile() using the list of GXACTs available for
* normal processing as well as at recovery when replaying individually
* each XLOG_XACT_PREPARE record.
*/
Instead, you may want to just say "If we have flushed any 2PC files,
flush the metadata in the pg_twophase directory."
Although, I thought that a simple case of creating a persistent table
which requires creating a file also would need to flush the directory.
I tried to locate the corresponding code to see if it also uses
fsync_fname(). I couldn't locate the code. I have looked at
mdcreate(), mdpreckpt(), mdpostckpt(). But may be that's not relevant
here.
I am adding that as well to the next CF for consideration.
Thoughts?
--
Michael--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Dec 29, 2016 at 6:41 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
I agree with this.
If no prepared transactions were required to be fsynced
CheckPointTwoPhase(), do we want to still fsync the directory?
Probably not.May be you want to call fsync_fname(TWOPHASE_DIR, true); if
serialized_xacts > 0.
Definitely true for the non-recovery code path. But not for restart
points as there is no GXACT entry created by the redo routine of 2PC
prepare. We could have a static counter tracking how many 2PC files
have been flushed since last restart point or not but I am not
convinced if that's worth the facility.
The comment just states what has been done earlier in the function,
which is documented in the prologue as well.
/*
* Make sure that the content created is persistent on disk to prevent
* data loss in case of an OS crash or a power failure. Each 2PC file
* has been already created and flushed to disk individually by
* RecreateTwoPhaseFile() using the list of GXACTs available for
* normal processing as well as at recovery when replaying individually
* each XLOG_XACT_PREPARE record.
*/
Instead, you may want to just say "If we have flushed any 2PC files,
flush the metadata in the pg_twophase directory."
That's not true for recovery. So I could go for something like that:
"If any 2PC files have been flushed, do the same for the parent
directory to make this information durable on disk. On recovery, issue
the fsync() anyway, individual 2PC files have already been flushed whe
replaying their respective XLOG_XACT_PREPARE record.
Although, I thought that a simple case of creating a persistent table
which requires creating a file also would need to flush the directory.
I tried to locate the corresponding code to see if it also uses
fsync_fname(). I couldn't locate the code. I have looked at
mdcreate(), mdpreckpt(), mdpostckpt(). But may be that's not relevant
here.
I have been playing with unlogged tables created, then followed by
checkpoints, and a VM plugged off but I could not see those files
getting away. That was on ext4, but like ext3 things can disappear if
the fsync() is forgotten on the parent directory.
--
Michael
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Dec 30, 2016 at 11:22 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
On Thu, Dec 29, 2016 at 6:41 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:I agree with this.
If no prepared transactions were required to be fsynced
CheckPointTwoPhase(), do we want to still fsync the directory?
Probably not.May be you want to call fsync_fname(TWOPHASE_DIR, true); if
serialized_xacts > 0.Definitely true for the non-recovery code path. But not for restart
points as there is no GXACT entry created by the redo routine of 2PC
prepare. We could have a static counter tracking how many 2PC files
have been flushed since last restart point or not but I am not
convinced if that's worth the facility.
As per the prologue of the function, it doesn't expect any 2PC files
to be written out in the function i.e. between two checkpoints. Most
of those are created and deleted between two checkpoints. Same would
be true for recovery as well. Thus in most of the cases we shouldn't
need to flush the two phase directory in this function whether during
normal operation or during the recovery. So, we should avoid flushing
repeatedly when it's not needed. I agree that serialized_xacts > 0 is
not the right condition during recovery on standby to flush the two
phase directory.
During crash recovery, 2PC files are present on the disk, which means
the two phase directory has correct record of it. This record can not
change. So, we shouldn't need to flush it again. If that's true
serialized_xacts will be 0 during recovery thus serialized_xacts > 0
condition will still hold.
On a standby however we will have to flush the two phase directory as
part of checkpoint if there were any files left behind in that
directory. We need a different condition there.
That's not true for recovery. So I could go for something like that:
"If any 2PC files have been flushed, do the same for the parent
directory to make this information durable on disk. On recovery, issue
the fsync() anyway, individual 2PC files have already been flushed whe
replaying their respective XLOG_XACT_PREPARE record.
We need to specify recovery (log replay) on standby specifically, I guess.
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Dec 30, 2016 at 5:20 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
As per the prologue of the function, it doesn't expect any 2PC files
to be written out in the function i.e. between two checkpoints. Most
of those are created and deleted between two checkpoints. Same would
be true for recovery as well. Thus in most of the cases we shouldn't
need to flush the two phase directory in this function whether during
normal operation or during the recovery. So, we should avoid flushing
repeatedly when it's not needed. I agree that serialized_xacts > 0 is
not the right condition during recovery on standby to flush the two
phase directory.
This is assuming that 2PC transactions are not long-lived, which is
likely true for anything doing sharding, like XC, XL or Citus (?). So
yes that's true to expect that.
During crash recovery, 2PC files are present on the disk, which means
the two phase directory has correct record of it. This record can not
change. So, we shouldn't need to flush it again. If that's true
serialized_xacts will be 0 during recovery thus serialized_xacts > 0
condition will still hold.On a standby however we will have to flush the two phase directory as
part of checkpoint if there were any files left behind in that
directory. We need a different condition there.
Well, flushing the meta-data of pg_twophase is really going to be far
cheaper than the many pages done until CheckpointTwoPhase is reached.
There should really be a check on serialized_xacts for the
non-recovery code path, but considering how cheap that's going to be
compared to the rest of the restart point stuff it is not worth the
complexity of adding a counter, for example in shared memory with
XLogCtl (the counter gets reinitialized at each checkpoint,
incremented when replaying a 2PC prepare, decremented with a 2PC
commit). So to reduce the backpatch impact I would just do the fsync
if (serialized_xact > 0 || RecoveryInProgress()) and call it a day.
--
Michael
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Well, flushing the meta-data of pg_twophase is really going to be far
cheaper than the many pages done until CheckpointTwoPhase is reached.
There should really be a check on serialized_xacts for the
non-recovery code path, but considering how cheap that's going to be
compared to the rest of the restart point stuff it is not worth the
complexity of adding a counter, for example in shared memory with
XLogCtl (the counter gets reinitialized at each checkpoint,
incremented when replaying a 2PC prepare, decremented with a 2PC
commit). So to reduce the backpatch impact I would just do the fsync
if (serialized_xact > 0 || RecoveryInProgress()) and call it a day.
Sounds ok to me. I will leave it to the committer to decide further.
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Dec 30, 2016 at 10:59 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
Well, flushing the meta-data of pg_twophase is really going to be far
cheaper than the many pages done until CheckpointTwoPhase is reached.
There should really be a check on serialized_xacts for the
non-recovery code path, but considering how cheap that's going to be
compared to the rest of the restart point stuff it is not worth the
complexity of adding a counter, for example in shared memory with
XLogCtl (the counter gets reinitialized at each checkpoint,
incremented when replaying a 2PC prepare, decremented with a 2PC
commit). So to reduce the backpatch impact I would just do the fsync
if (serialized_xact > 0 || RecoveryInProgress()) and call it a day.Sounds ok to me. I will leave it to the committer to decide further.
Attached is an updated patch. I also noticed that it is better to do
the fsync call before the dtrace call for checkpoint end as well as
logging.
--
Michael
Attachments:
2pc-loss-v2.patchapplication/stream; name=2pc-loss-v2.patchDownload
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 5415604993..7a2e32e265 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -1641,6 +1641,15 @@ CheckPointTwoPhase(XLogRecPtr redo_horizon)
}
LWLockRelease(TwoPhaseStateLock);
+ /*
+ * If any 2PC files have been flushed, do the same for the parent
+ * directory to make this information durable on disk. On recovery,
+ * issue the fsync() anyway, individual 2PC files have already been
+ * flushed when replaying their respective XLOG_XACT_PREPARE record.
+ */
+ if (serialized_xacts > 0 || RecoveryInProgress())
+ fsync_fname(TWOPHASE_DIR, true);
+
TRACE_POSTGRESQL_TWOPHASE_CHECKPOINT_DONE();
if (log_checkpoints && serialized_xacts > 0)
On Sat, Dec 31, 2016 at 5:53 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
On Fri, Dec 30, 2016 at 10:59 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:Well, flushing the meta-data of pg_twophase is really going to be far
cheaper than the many pages done until CheckpointTwoPhase is reached.
There should really be a check on serialized_xacts for the
non-recovery code path, but considering how cheap that's going to be
compared to the rest of the restart point stuff it is not worth the
complexity of adding a counter, for example in shared memory with
XLogCtl (the counter gets reinitialized at each checkpoint,
incremented when replaying a 2PC prepare, decremented with a 2PC
commit). So to reduce the backpatch impact I would just do the fsync
if (serialized_xact > 0 || RecoveryInProgress()) and call it a day.Sounds ok to me. I will leave it to the committer to decide further.
Attached is an updated patch. I also noticed that it is better to do
the fsync call before the dtrace call for checkpoint end as well as
logging.
The patch looks good to me.
I am wondering what happens if a 2PC file gets created, at the time of
checkpoint we flush the pg_twophase directory, then the file gets
removed. Do we need to flush the directory to ensure that the removal
persists? Whatever material I look for fsync() on directory, it gives
examples of file creation, not that of deleting a file. If we want to
persist the removal, probably we have to flush pg_twophase always or
add code to track whether any activity happened in pg_twophase between
two checkpoints. The later seems complication not worth the benefit.
I guess, it's hard to construct a case to reproduce the issue
described in your first mail. But still checking if you have any
reproduction. May be we could use similar reproduction to test the
deletion of two phase file.
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Jan 3, 2017 at 3:32 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
I am wondering what happens if a 2PC file gets created, at the time of
checkpoint we flush the pg_twophase directory, then the file gets
removed. Do we need to flush the directory to ensure that the removal
persists? Whatever material I look for fsync() on directory, it gives
examples of file creation, not that of deleting a file. If we want to
persist the removal, probably we have to flush pg_twophase always or
add code to track whether any activity happened in pg_twophase between
two checkpoints. The later seems complication not worth the benefit.
There is already the delay checkpoint machinery to cover timing
problems here. Have a look for example at EndPrepare()@twophase.c.
I guess, it's hard to construct a case to reproduce the issue
described in your first mail. But still checking if you have any
reproduction. May be we could use similar reproduction to test the
deletion of two phase file.
Not really. You can just do the test on a VM (one transaction
generating a 2PC file, followed by a checkpoint), then kill-9 its
parent instance. That's radical to emulate the power loss. I do that
on macos with VMware Fusion.
--
Michael
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Jan 3, 2017 at 2:50 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
On Tue, Jan 3, 2017 at 3:32 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:I am wondering what happens if a 2PC file gets created, at the time of
checkpoint we flush the pg_twophase directory, then the file gets
removed. Do we need to flush the directory to ensure that the removal
persists? Whatever material I look for fsync() on directory, it gives
examples of file creation, not that of deleting a file. If we want to
persist the removal, probably we have to flush pg_twophase always or
add code to track whether any activity happened in pg_twophase between
two checkpoints. The later seems complication not worth the benefit.There is already the delay checkpoint machinery to cover timing
problems here. Have a look for example at EndPrepare()@twophase.c.
Are you talking about
/*
* Now we can mark ourselves as out of the commit critical section: a
* checkpoint starting after this will certainly see the gxact as a
* candidate for fsyncing.
*/
MyPgXact->delayChkpt = false;
That's while creating the file. I do not see similar code in
FinishPreparedTransaction() where the 2PC file is removed.
I guess, it's hard to construct a case to reproduce the issue
described in your first mail. But still checking if you have any
reproduction. May be we could use similar reproduction to test the
deletion of two phase file.Not really. You can just do the test on a VM (one transaction
generating a 2PC file, followed by a checkpoint), then kill-9 its
parent instance. That's radical to emulate the power loss. I do that
on macos with VMware Fusion.
--
Michael
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Jan 3, 2017 at 8:41 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
Are you talking about
/*
* Now we can mark ourselves as out of the commit critical section: a
* checkpoint starting after this will certainly see the gxact as a
* candidate for fsyncing.
*/
MyPgXact->delayChkpt = false;That's while creating the file. I do not see similar code in
FinishPreparedTransaction() where the 2PC file is removed.
RecordTransactionCommitPrepared() does the same work for COMMIT PREPARED.
--
Michael
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Jan 3, 2017 at 5:38 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
On Tue, Jan 3, 2017 at 8:41 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:Are you talking about
/*
* Now we can mark ourselves as out of the commit critical section: a
* checkpoint starting after this will certainly see the gxact as a
* candidate for fsyncing.
*/
MyPgXact->delayChkpt = false;That's while creating the file. I do not see similar code in
FinishPreparedTransaction() where the 2PC file is removed.RecordTransactionCommitPrepared() does the same work for COMMIT PREPARED.
Ok.
I don't have anything more to review in this patch. I will leave that
commitfest entry in "needs review" status for few days in case anyone
else wants to review it. If none is going to review it, we can mark it
as "ready for committer".
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Jan 4, 2017 at 1:23 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
I don't have anything more to review in this patch. I will leave that
commitfest entry in "needs review" status for few days in case anyone
else wants to review it. If none is going to review it, we can mark it
as "ready for committer".
Thanks for the input!
--
Michael
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Marking this as ready for committer.
On Wed, Jan 4, 2017 at 12:16 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
On Wed, Jan 4, 2017 at 1:23 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:I don't have anything more to review in this patch. I will leave that
commitfest entry in "needs review" status for few days in case anyone
else wants to review it. If none is going to review it, we can mark it
as "ready for committer".Thanks for the input!
--
Michael
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 12/27/2016 01:31 PM, Andres Freund wrote:
On 2016-12-27 14:09:05 +0900, Michael Paquier wrote:
On Fri, Dec 23, 2016 at 3:02 AM, Andres Freund <andres@anarazel.de> wrote:
Not quite IIRC: that doesn't deal with file size increase. All this would be easier if hardlinks wouldn't exist IIUC. It's basically a question whether dentry, inode or contents need to be synced. Yes, it sucks.
I did more monitoring of the code... Looking at unlogged tables and
empty() routines of access methods, isn't there a risk as well for
unlogged tables? mdimmedsync() does not fsync() the parent directory
either!But the files aren't created there, so I don't generally see the
problem. And the creation of the metapages etc. should be WAL logged
anyway. So I don't think we should / need to do anything special
there. You can argue however that we wouldn't necessarily fsync the
parent directory for the file creation, ever. But that'd be more
smgrcreate's responsibility than anything.
So, if I understood correctly, the problem scenario is:
1. Create and write to a file.
2. fsync() the file.
3. Crash.
4. After restart, the file is gone.
If that can happen, don't we have the same problem in many other places?
Like, all the SLRUs? They don't fsync the directory either.
Is unlink() guaranteed to be durable, without fsyncing the directory? If
not, then we need to fsync() the directory even if there are no files in
it at the moment, because some might've been removed earlier in the
checkpoint cycle.
- Heikki
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Jan 30, 2017 at 10:52 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
So, if I understood correctly, the problem scenario is:
1. Create and write to a file.
2. fsync() the file.
3. Crash.
4. After restart, the file is gone.
Yes, that's a problem with fsync's durability, and we need to achieve
that at checkpoint. I find [1]http://blog.httrack.com/blog/2013/11/15/everything-you-always-wanted-to-know-about-fsync/ a good read on the matter. That's
easier to decrypt than [2]http://pubs.opengroup.org/onlinepubs/009695399/functions/fsync.html or [3]http://pubs.opengroup.org/onlinepubs/009695399/functions/rename.html in the POSIX spec..
[1]: http://blog.httrack.com/blog/2013/11/15/everything-you-always-wanted-to-know-about-fsync/
[2]: http://pubs.opengroup.org/onlinepubs/009695399/functions/fsync.html
[3]: http://pubs.opengroup.org/onlinepubs/009695399/functions/rename.html
If that can happen, don't we have the same problem in many other places?
Like, all the SLRUs? They don't fsync the directory either.
Right, pg_commit_ts and pg_clog enter in this category.
Is unlink() guaranteed to be durable, without fsyncing the directory? If
not, then we need to fsync() the directory even if there are no files in it
at the moment, because some might've been removed earlier in the checkpoint
cycle.
Hm... I am not an expert in file systems. At least on ext4 I can see
that unlink() is atomic, but not durable. So if an unlink() is
followed by a power failure, the previously unlinked file could be
here if the parent directory is not fsync'd.
--
Michael
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Jan 6, 2017 at 9:26 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
On Wed, Jan 4, 2017 at 12:16 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:On Wed, Jan 4, 2017 at 1:23 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:I don't have anything more to review in this patch. I will leave that
commitfest entry in "needs review" status for few days in case anyone
else wants to review it. If none is going to review it, we can mark it
as "ready for committer".Thanks for the input!
Marking this as ready for committer.
Patch is moved to CF 2017-03.
--
Michael
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Jan 31, 2017 at 11:07 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
On Mon, Jan 30, 2017 at 10:52 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
If that can happen, don't we have the same problem in many other places?
Like, all the SLRUs? They don't fsync the directory either.Right, pg_commit_ts and pg_clog enter in this category.
Implemented as attached.
Is unlink() guaranteed to be durable, without fsyncing the directory? If
not, then we need to fsync() the directory even if there are no files in it
at the moment, because some might've been removed earlier in the checkpoint
cycle.Hm... I am not an expert in file systems. At least on ext4 I can see
that unlink() is atomic, but not durable. So if an unlink() is
followed by a power failure, the previously unlinked file could be
here if the parent directory is not fsync'd.
So I have been doing more work on this patch, with the following things done:
- Flush pg_clog, pg_commit_ts and pg_twophase at checkpoint phase to
ensure their durability.
- Create a durable_unlink() routine to give a way to perform a durable
file removal.
I am now counting 111 calls to unlink() in the backend code, and
looking at all of them most look fine with plain unlink() if they are
not made durable as they work on temporary files (see timeline.c for
example), with some exceptions:
- In pg_stop_backup, the old backup_label and tablespace_map removal
should be durable to avoid putting the system in a wrong state after
power loss. Other calls of unlink() are followed by durable_rename so
they are fine if let as such.
- Removal of old WAL segments should be durable as well. There is
already an effort to rename them durably in case of a segment
recycled. In case of a power loss, a file that should have been
removed could remain in pg_xlog.
Looking around, I have bumped as well on the following bug report for
SQlite which is in the same category of things:
http://sqlite.1065341.n5.nabble.com/Potential-bug-in-crash-recovery-code-unlink-and-friends-are-not-synchronous-td68885.html
Scary to see that in this case durability can be a problem at
transaction commit...
--
Michael
Attachments:
unlink-2pc-loss-v3.patchapplication/octet-stream; name=unlink-2pc-loss-v3.patchDownload
diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index 1a43819736..11cba70ece 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -576,6 +576,13 @@ ShutdownCLOG(void)
/* Flush dirty CLOG pages to disk */
TRACE_POSTGRESQL_CLOG_CHECKPOINT_START(false);
SimpleLruFlush(ClogCtl, false);
+
+ /*
+ * fsync pg_clog to ensure that any files flushed previously are durably
+ * on disk.
+ */
+ fsync_fname("pg_clog", true);
+
TRACE_POSTGRESQL_CLOG_CHECKPOINT_DONE(false);
}
@@ -588,6 +595,13 @@ CheckPointCLOG(void)
/* Flush dirty CLOG pages to disk */
TRACE_POSTGRESQL_CLOG_CHECKPOINT_START(true);
SimpleLruFlush(ClogCtl, true);
+
+ /*
+ * fsync pg_clog to ensure that any files flushed previously are durably
+ * on disk.
+ */
+ fsync_fname("pg_clog", true);
+
TRACE_POSTGRESQL_CLOG_CHECKPOINT_DONE(true);
}
diff --git a/src/backend/access/transam/commit_ts.c b/src/backend/access/transam/commit_ts.c
index 20f60bc023..d36014f852 100644
--- a/src/backend/access/transam/commit_ts.c
+++ b/src/backend/access/transam/commit_ts.c
@@ -746,6 +746,12 @@ ShutdownCommitTs(void)
{
/* Flush dirty CommitTs pages to disk */
SimpleLruFlush(CommitTsCtl, false);
+
+ /*
+ * fsync pg_commit_ts to ensure that any files flushed previously are durably
+ * on disk.
+ */
+ fsync_fname("pg_commit_ts", true);
}
/*
@@ -756,6 +762,12 @@ CheckPointCommitTs(void)
{
/* Flush dirty CommitTs pages to disk */
SimpleLruFlush(CommitTsCtl, true);
+
+ /*
+ * fsync pg_commit_ts to ensure that any files flushed previously are durably
+ * on disk.
+ */
+ fsync_fname("pg_commit_ts", true);
}
/*
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 50c70b2920..94ff9f3ba5 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -1642,6 +1642,14 @@ CheckPointTwoPhase(XLogRecPtr redo_horizon)
}
LWLockRelease(TwoPhaseStateLock);
+ /*
+ * Flush unconditionally the parent directory to make any information
+ * durable on disk. Two-phase files could have been removed and those
+ * removals need to be made persistent as well as any files newly created
+ * previously since the last checkpoint.
+ */
+ fsync_fname(TWOPHASE_DIR, true);
+
TRACE_POSTGRESQL_TWOPHASE_CHECKPOINT_DONE();
if (log_checkpoints && serialized_xacts > 0)
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 2dcff7f54b..d319b32887 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -3445,7 +3445,7 @@ InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
if (!find_free)
{
/* Force installation: get rid of any pre-existing segment file */
- unlink(path);
+ durable_unlink(path, DEBUG1);
}
else
{
@@ -3996,16 +3996,13 @@ RemoveXlogFile(const char *segname, XLogRecPtr PriorRedoPtr, XLogRecPtr endptr)
path)));
return;
}
- rc = unlink(newpath);
+ rc = durable_unlink(newpath, LOG);
#else
- rc = unlink(path);
+ rc = durable_unlink(path, LOG);
#endif
if (rc != 0)
{
- ereport(LOG,
- (errcode_for_file_access(),
- errmsg("could not remove old transaction log file \"%s\": %m",
- path)));
+ /* Message already logged by durable_unlink() */
return;
}
CheckpointStats.ckpt_segs_removed++;
@@ -10699,17 +10696,13 @@ do_pg_stop_backup(char *labelfile, bool waitforarchive, TimeLineID *stoptli_p)
(errcode_for_file_access(),
errmsg("could not read file \"%s\": %m",
BACKUP_LABEL_FILE)));
- if (unlink(BACKUP_LABEL_FILE) != 0)
- ereport(ERROR,
- (errcode_for_file_access(),
- errmsg("could not remove file \"%s\": %m",
- BACKUP_LABEL_FILE)));
+ durable_unlink(BACKUP_LABEL_FILE, ERROR);
/*
* Remove tablespace_map file if present, it is created only if there
* are tablespaces.
*/
- unlink(TABLESPACE_MAP);
+ durable_unlink(TABLESPACE_MAP, DEBUG1);
}
PG_END_ENSURE_ERROR_CLEANUP(pg_stop_backup_callback, (Datum) BoolGetDatum(exclusive));
}
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index ce4bd0f3de..4e7ad6c0a5 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -651,6 +651,43 @@ durable_rename(const char *oldfile, const char *newfile, int elevel)
}
/*
+ * durable_unlink -- remove a file in a durable manner
+ *
+ * This routine ensures that, after returning, the effect of removing file
+ * persists in case of a crash. A crash while this routine is running will
+ * leave the system in no mixed state.
+ *
+ * It does so by using fsync on the parent directory of the file after the
+ * actual removal is done.
+ *
+ * Log errors with the severity specified by caller.
+ *
+ * Returns 0 if the operation succeeded, -1 otherwise. Note that errno is not
+ * valid upon return.
+ */
+int
+durable_unlink(const char *fname, int elevel)
+{
+ if (unlink(fname) < 0)
+ {
+ ereport(elevel,
+ (errcode_for_file_access(),
+ errmsg("could not remove file \"%s\": %m",
+ fname)));
+ return -1;
+ }
+
+ /*
+ * To guarantee that the removal of the file is persistent, fsync
+ * its parent directory.
+ */
+ if (fsync_parent_path(fname, elevel) != 0)
+ return -1;
+
+ return 0;
+}
+
+/*
* durable_link_or_rename -- rename a file in a durable manner.
*
* Similar to durable_rename(), except that this routine tries (but does not
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 1a43a2c844..1c73075071 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -119,6 +119,7 @@ extern int pg_fdatasync(int fd);
extern void pg_flush_data(int fd, off_t offset, off_t amount);
extern void fsync_fname(const char *fname, bool isdir);
extern int durable_rename(const char *oldfile, const char *newfile, int loglevel);
+extern int durable_unlink(const char *fname, int loglevel);
extern int durable_link_or_rename(const char *oldfile, const char *newfile, int loglevel);
extern void SyncDataDirectory(void);
On 2/13/17 12:10 AM, Michael Paquier wrote:
On Tue, Jan 31, 2017 at 11:07 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:On Mon, Jan 30, 2017 at 10:52 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
If that can happen, don't we have the same problem in many other places?
Like, all the SLRUs? They don't fsync the directory either.Right, pg_commit_ts and pg_clog enter in this category.
Implemented as attached.
Is unlink() guaranteed to be durable, without fsyncing the directory? If
not, then we need to fsync() the directory even if there are no files in it
at the moment, because some might've been removed earlier in the checkpoint
cycle.Hm... I am not an expert in file systems. At least on ext4 I can see
that unlink() is atomic, but not durable. So if an unlink() is
followed by a power failure, the previously unlinked file could be
here if the parent directory is not fsync'd.So I have been doing more work on this patch, with the following things done:
- Flush pg_clog, pg_commit_ts and pg_twophase at checkpoint phase to
ensure their durability.
- Create a durable_unlink() routine to give a way to perform a durable
file removal.
I am now counting 111 calls to unlink() in the backend code, and
looking at all of them most look fine with plain unlink() if they are
not made durable as they work on temporary files (see timeline.c for
example), with some exceptions:
- In pg_stop_backup, the old backup_label and tablespace_map removal
should be durable to avoid putting the system in a wrong state after
power loss. Other calls of unlink() are followed by durable_rename so
they are fine if let as such.
- Removal of old WAL segments should be durable as well. There is
already an effort to rename them durably in case of a segment
recycled. In case of a power loss, a file that should have been
removed could remain in pg_xlog.Looking around, I have bumped as well on the following bug report for
SQlite which is in the same category of things:
http://sqlite.1065341.n5.nabble.com/Potential-bug-in-crash-recovery-code-unlink-and-friends-are-not-synchronous-td68885.html
Scary to see that in this case durability can be a problem at
transaction commit...
This patch applies cleanly and compiles at cccbdde.
Ashutosh, do you know when you'll have a chance to review?
--
-David
david@pgmasters.net
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Mar 16, 2017 at 10:17 PM, David Steele <david@pgmasters.net> wrote:
On 2/13/17 12:10 AM, Michael Paquier wrote:
On Tue, Jan 31, 2017 at 11:07 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:On Mon, Jan 30, 2017 at 10:52 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
If that can happen, don't we have the same problem in many other places?
Like, all the SLRUs? They don't fsync the directory either.Right, pg_commit_ts and pg_clog enter in this category.
Implemented as attached.
Is unlink() guaranteed to be durable, without fsyncing the directory? If
not, then we need to fsync() the directory even if there are no files in it
at the moment, because some might've been removed earlier in the checkpoint
cycle.Hm... I am not an expert in file systems. At least on ext4 I can see
that unlink() is atomic, but not durable. So if an unlink() is
followed by a power failure, the previously unlinked file could be
here if the parent directory is not fsync'd.So I have been doing more work on this patch, with the following things done:
- Flush pg_clog, pg_commit_ts and pg_twophase at checkpoint phase to
ensure their durability.
- Create a durable_unlink() routine to give a way to perform a durable
file removal.
I am now counting 111 calls to unlink() in the backend code, and
looking at all of them most look fine with plain unlink() if they are
not made durable as they work on temporary files (see timeline.c for
example), with some exceptions:
- In pg_stop_backup, the old backup_label and tablespace_map removal
should be durable to avoid putting the system in a wrong state after
power loss. Other calls of unlink() are followed by durable_rename so
they are fine if let as such.
- Removal of old WAL segments should be durable as well. There is
already an effort to rename them durably in case of a segment
recycled. In case of a power loss, a file that should have been
removed could remain in pg_xlog.Looking around, I have bumped as well on the following bug report for
SQlite which is in the same category of things:
http://sqlite.1065341.n5.nabble.com/Potential-bug-in-crash-recovery-code-unlink-and-friends-are-not-synchronous-td68885.html
Scary to see that in this case durability can be a problem at
transaction commit...This patch applies cleanly and compiles at cccbdde.
Ashutosh, do you know when you'll have a chance to review?
The scope of this work has expanded, since last time I reviewed and
marked it as RFC. Right now I am busy with partition-wise joins and do
not have sufficient time to take a look at the expanded scope.
However, I can come back to this after partition-wise join, but that
may stretch to the end of the commitfest. Sorry.
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Ashutosh Bapat
The scope of this work has expanded, since last time I reviewed and marked
it as RFC. Right now I am busy with partition-wise joins and do not have
sufficient time to take a look at the expanded scope.
However, I can come back to this after partition-wise join, but that may
stretch to the end of the commitfest. Sorry.
I marked this as ready for committer. The code looks good, make check passed, ran read/write pgbench for some period to cause checkpoint with WAL file removal, and pg_ctl stop. The checkpoint emitted the following message, and there were no message like "could not ..." that indicates a file unlink or directory sync failure.
LOG: checkpoint complete: wrote 1835 buffers (11.2%); 0 transaction log file(s) added, 1 removed, 0 recycled; write=0.725 s, sync=0.002 s, total=0.746 s; s\
ync files=9, longest=0.002 s, average=0.000 s; distance=16381 kB, estimate=16381 kB
Regards
Takayuki Tsunakawa
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
If that can happen, don't we have the same problem in many other places?
Like, all the SLRUs? They don't fsync the directory either.Right, pg_commit_ts and pg_clog enter in this category.
Implemented as attached.
Is unlink() guaranteed to be durable, without fsyncing the directory? If
not, then we need to fsync() the directory even if there are no files in it
at the moment, because some might've been removed earlier in the checkpoint
cycle.
What is protection if pg crashes after unlimk() but before fsync()? Right, it's
rather small window for such scenario, but isn't better to have another
protection? Like WAL-logging of WAL segment removing...
--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Mar 22, 2017 at 12:46 AM, Teodor Sigaev <teodor@sigaev.ru> wrote:
If that can happen, don't we have the same problem in many other places?
Like, all the SLRUs? They don't fsync the directory either.Right, pg_commit_ts and pg_clog enter in this category.
Implemented as attached.
Is unlink() guaranteed to be durable, without fsyncing the directory? If
not, then we need to fsync() the directory even if there are no files in
it
at the moment, because some might've been removed earlier in the
checkpoint
cycle.What is protection if pg crashes after unlimk() but before fsync()? Right,
it's rather small window for such scenario, but isn't better to have
another protection?
This may apply in some cases, but my lookup on the matter regarding
this patch is that we don't actually need something like that yet,
because there are no cases where we could apply it:
- Removal of past WAL segments is on a per-node base.
- Installation of new segments is guessable.
- Removal of backup_label and tablespace map is based on the timings
of pg_start/stop_backup, once again on a per-node basis.
This patch reduces the window to the minimum we can do: once
durable_unlink() leaves, we can guarantee that the system is in a
durable state.
Like WAL-logging of WAL segment removing...
The pace of removal of the WAL segments is different on each node, and
should be different on each node (for example there could be
replication slots with different retention policies on cascading
downstream standbys). So that's not a concept you can apply here.
--
Michael
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hmm, it doesn't work (but appplies) on current HEAD:
% uname -a
FreeBSD *** 11.0-RELEASE-p8 FreeBSD 11.0-RELEASE-p8 #0 r315651: Tue Mar 21
02:44:23 MSK 2017 teodor@***:/usr/obj/usr/src/sys/XOR amd64
% pg_config --configure
'--enable-depend' '--enable-cassert' '--enable-debug' '--enable-tap-tests'
'--with-perl' 'CC=clang' 'CFLAGS=-O0'
% rm -rf /spool/pg_data/*
% initdb -n -E UTF8 --locale ru_RU.UTF-8 --lc-messages C --lc-monetary C
--lc-numeric C --lc-time C -D /spool/pg_data
Running in no-clean mode. Mistakes will not be cleaned up.
The files belonging to this database system will be owned by user "teodor".
This user must also own the server process.
The database cluster will be initialized with locales
COLLATE: ru_RU.UTF-8
CTYPE: ru_RU.UTF-8
MESSAGES: C
MONETARY: C
NUMERIC: C
TIME: C
The default text search configuration will be set to "russian".
Data page checksums are disabled.
fixing permissions on existing directory /spool/pg_data ... ok
creating subdirectories ... ok
selecting default max_connections ... 100
selecting default shared_buffers ... 128MB
selecting dynamic shared memory implementation ... posix
creating configuration files ... ok
running bootstrap script ... 2017-03-23 23:07:14.705 MSK [40286] FATAL: could
not open file "pg_clog": No such file or directory
child process exited with exit code 1
initdb: data directory "/spool/pg_data" not removed at user's request
--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Mar 24, 2017 at 5:08 AM, Teodor Sigaev <teodor@sigaev.ru> wrote:
Hmm, it doesn't work (but appplies) on current HEAD:
[...]
Data page checksums are disabled.fixing permissions on existing directory /spool/pg_data ... ok
creating subdirectories ... ok
selecting default max_connections ... 100
selecting default shared_buffers ... 128MB
selecting dynamic shared memory implementation ... posix
creating configuration files ... ok
running bootstrap script ... 2017-03-23 23:07:14.705 MSK [40286] FATAL:
could not open file "pg_clog": No such file or directory
child process exited with exit code 1
initdb: data directory "/spool/pg_data" not removed at user's request
And the renaming of pg_clog to pg_xact is also my fault. Attached is
an updated patch.
--
Michael
Attachments:
unlink-2pc-loss-v4.patchtext/x-patch; charset=US-ASCII; name=unlink-2pc-loss-v4.patchDownload
diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index 2d33510930..7a007a6ba5 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -577,6 +577,13 @@ ShutdownCLOG(void)
/* Flush dirty CLOG pages to disk */
TRACE_POSTGRESQL_CLOG_CHECKPOINT_START(false);
SimpleLruFlush(ClogCtl, false);
+
+ /*
+ * fsync pg_xact to ensure that any files flushed previously are durably
+ * on disk.
+ */
+ fsync_fname("pg_xact", true);
+
TRACE_POSTGRESQL_CLOG_CHECKPOINT_DONE(false);
}
@@ -589,6 +596,13 @@ CheckPointCLOG(void)
/* Flush dirty CLOG pages to disk */
TRACE_POSTGRESQL_CLOG_CHECKPOINT_START(true);
SimpleLruFlush(ClogCtl, true);
+
+ /*
+ * fsync pg_xact to ensure that any files flushed previously are durably
+ * on disk.
+ */
+ fsync_fname("pg_xact", true);
+
TRACE_POSTGRESQL_CLOG_CHECKPOINT_DONE(true);
}
diff --git a/src/backend/access/transam/commit_ts.c b/src/backend/access/transam/commit_ts.c
index 8e1df6e0ea..03ffa20908 100644
--- a/src/backend/access/transam/commit_ts.c
+++ b/src/backend/access/transam/commit_ts.c
@@ -746,6 +746,12 @@ ShutdownCommitTs(void)
{
/* Flush dirty CommitTs pages to disk */
SimpleLruFlush(CommitTsCtl, false);
+
+ /*
+ * fsync pg_commit_ts to ensure that any files flushed previously are durably
+ * on disk.
+ */
+ fsync_fname("pg_commit_ts", true);
}
/*
@@ -756,6 +762,12 @@ CheckPointCommitTs(void)
{
/* Flush dirty CommitTs pages to disk */
SimpleLruFlush(CommitTsCtl, true);
+
+ /*
+ * fsync pg_commit_ts to ensure that any files flushed previously are durably
+ * on disk.
+ */
+ fsync_fname("pg_commit_ts", true);
}
/*
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 4b4999fd7b..83169cccc3 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -1650,6 +1650,14 @@ CheckPointTwoPhase(XLogRecPtr redo_horizon)
}
LWLockRelease(TwoPhaseStateLock);
+ /*
+ * Flush unconditionally the parent directory to make any information
+ * durable on disk. Two-phase files could have been removed and those
+ * removals need to be made persistent as well as any files newly created
+ * previously since the last checkpoint.
+ */
+ fsync_fname(TWOPHASE_DIR, true);
+
TRACE_POSTGRESQL_TWOPHASE_CHECKPOINT_DONE();
if (log_checkpoints && serialized_xacts > 0)
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index b99ded5df6..11b1929c94 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -3469,7 +3469,7 @@ InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
if (!find_free)
{
/* Force installation: get rid of any pre-existing segment file */
- unlink(path);
+ durable_unlink(path, DEBUG1);
}
else
{
@@ -4020,16 +4020,13 @@ RemoveXlogFile(const char *segname, XLogRecPtr PriorRedoPtr, XLogRecPtr endptr)
path)));
return;
}
- rc = unlink(newpath);
+ rc = durable_unlink(newpath, LOG);
#else
- rc = unlink(path);
+ rc = durable_unlink(path, LOG);
#endif
if (rc != 0)
{
- ereport(LOG,
- (errcode_for_file_access(),
- errmsg("could not remove old transaction log file \"%s\": %m",
- path)));
+ /* Message already logged by durable_unlink() */
return;
}
CheckpointStats.ckpt_segs_removed++;
@@ -10752,17 +10749,13 @@ do_pg_stop_backup(char *labelfile, bool waitforarchive, TimeLineID *stoptli_p)
(errcode_for_file_access(),
errmsg("could not read file \"%s\": %m",
BACKUP_LABEL_FILE)));
- if (unlink(BACKUP_LABEL_FILE) != 0)
- ereport(ERROR,
- (errcode_for_file_access(),
- errmsg("could not remove file \"%s\": %m",
- BACKUP_LABEL_FILE)));
+ durable_unlink(BACKUP_LABEL_FILE, ERROR);
/*
* Remove tablespace_map file if present, it is created only if there
* are tablespaces.
*/
- unlink(TABLESPACE_MAP);
+ durable_unlink(TABLESPACE_MAP, DEBUG1);
}
PG_END_ENSURE_ERROR_CLEANUP(pg_stop_backup_callback, (Datum) BoolGetDatum(exclusive));
}
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index f0ed2e9b5f..b14979496c 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -658,6 +658,43 @@ durable_rename(const char *oldfile, const char *newfile, int elevel)
}
/*
+ * durable_unlink -- remove a file in a durable manner
+ *
+ * This routine ensures that, after returning, the effect of removing file
+ * persists in case of a crash. A crash while this routine is running will
+ * leave the system in no mixed state.
+ *
+ * It does so by using fsync on the parent directory of the file after the
+ * actual removal is done.
+ *
+ * Log errors with the severity specified by caller.
+ *
+ * Returns 0 if the operation succeeded, -1 otherwise. Note that errno is not
+ * valid upon return.
+ */
+int
+durable_unlink(const char *fname, int elevel)
+{
+ if (unlink(fname) < 0)
+ {
+ ereport(elevel,
+ (errcode_for_file_access(),
+ errmsg("could not remove file \"%s\": %m",
+ fname)));
+ return -1;
+ }
+
+ /*
+ * To guarantee that the removal of the file is persistent, fsync
+ * its parent directory.
+ */
+ if (fsync_parent_path(fname, elevel) != 0)
+ return -1;
+
+ return 0;
+}
+
+/*
* durable_link_or_rename -- rename a file in a durable manner.
*
* Similar to durable_rename(), except that this routine tries (but does not
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index ac37502928..05680499e4 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -119,6 +119,7 @@ extern int pg_fdatasync(int fd);
extern void pg_flush_data(int fd, off_t offset, off_t amount);
extern void fsync_fname(const char *fname, bool isdir);
extern int durable_rename(const char *oldfile, const char *newfile, int loglevel);
+extern int durable_unlink(const char *fname, int loglevel);
extern int durable_link_or_rename(const char *oldfile, const char *newfile, int loglevel);
extern void SyncDataDirectory(void);
And the renaming of pg_clog to pg_xact is also my fault. Attached is
an updated patch.
Thank you. One more question: what about symlinks? If DBA moves, for example,
pg_xact to another dist and leaves the symlink in data directoty. Suppose, fsync
on symlink will do nothing actually.
--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Mar 24, 2017 at 11:36 PM, Teodor Sigaev <teodor@sigaev.ru> wrote:
And the renaming of pg_clog to pg_xact is also my fault. Attached is
an updated patch.Thank you. One more question: what about symlinks? If DBA moves, for
example, pg_xact to another dist and leaves the symlink in data directoty.
Suppose, fsync on symlink will do nothing actually.
I did not think of this case, but is that really common? There is even
no option to configure that at command level. And surely we cannot
support any fancy scenario that people figure out using PGDATA.
Existing callers of fsync_fname don't bother about this case as well
by the way, take the calls related to pg_logical and pg_repslot.
If something should be done in this area, that would be surely in
fsync_fname directly to centralize all the calls, and I would think of
that as a separate patch, and a separate discussion.
--
Michael
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Thank you. One more question: what about symlinks? If DBA moves, for
example, pg_xact to another dist and leaves the symlink in data directoty.
Suppose, fsync on symlink will do nothing actually.I did not think of this case, but is that really common? There is even
I saw a lot such cases.
If something should be done in this area, that would be surely in
fsync_fname directly to centralize all the calls, and I would think of
that as a separate patch, and a separate discussion.
Agree
--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Thank you, pushed
Michael Paquier wrote:
On Fri, Mar 24, 2017 at 11:36 PM, Teodor Sigaev <teodor@sigaev.ru> wrote:
And the renaming of pg_clog to pg_xact is also my fault. Attached is
an updated patch.Thank you. One more question: what about symlinks? If DBA moves, for
example, pg_xact to another dist and leaves the symlink in data directoty.
Suppose, fsync on symlink will do nothing actually.I did not think of this case, but is that really common? There is even
no option to configure that at command level. And surely we cannot
support any fancy scenario that people figure out using PGDATA.
Existing callers of fsync_fname don't bother about this case as well
by the way, take the calls related to pg_logical and pg_repslot.If something should be done in this area, that would be surely in
fsync_fname directly to centralize all the calls, and I would think of
that as a separate patch, and a separate discussion.
--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Mar 28, 2017 at 1:34 AM, Teodor Sigaev <teodor@sigaev.ru> wrote:
Thank you, pushed.
Thanks!
Do you think that this qualifies as a bug fix for a backpatch? I would
think so, but I would not mind waiting for some dust to be on it
before considering applying that on back-branches. Thoughts from
others?
--
Michael
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Michael Paquier
Do you think that this qualifies as a bug fix for a backpatch? I would think
so, but I would not mind waiting for some dust to be on it before considering
applying that on back-branches. Thoughts from others?
I think so, too. As change from rename() to durable_rename() was applied to all releases, maybe we can follow that. In addition, keeping code differences as few as possible would make it easier for committers to apply other bug fixes to all versions.
Regards
Takayuki Tsunakawa
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Mar 28, 2017 at 8:38 AM, Tsunakawa, Takayuki
<tsunakawa.takay@jp.fujitsu.com> wrote:
From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Michael Paquier
Do you think that this qualifies as a bug fix for a backpatch? I would think
so, but I would not mind waiting for some dust to be on it before considering
applying that on back-branches. Thoughts from others?I think so, too. As change from rename() to durable_rename() was applied to all releases, maybe we can follow that. In addition, keeping code differences as few as possible would make it easier for committers to apply other bug fixes to all versions.
The patch of HEAD applies cleanly on REL9_6_STABLE, further down it
needs some work but not much either. I am fine to produce those
patches should there be any need to.
--
Michael
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Mar 28, 2017 at 9:37 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
On Tue, Mar 28, 2017 at 8:38 AM, Tsunakawa, Takayuki
<tsunakawa.takay@jp.fujitsu.com> wrote:From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Michael Paquier
Do you think that this qualifies as a bug fix for a backpatch? I would think
so, but I would not mind waiting for some dust to be on it before considering
applying that on back-branches. Thoughts from others?I think so, too. As change from rename() to durable_rename() was applied to all releases, maybe we can follow that. In addition, keeping code differences as few as possible would make it easier for committers to apply other bug fixes to all versions.
The patch of HEAD applies cleanly on REL9_6_STABLE,
Actually that's not correct as well as pg_clog has been renamed to
pg_xact. Let's be careful here.
--
Michael
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers