silent data loss with ext4 / all current versions

Started by Tomas Vondraabout 10 years ago86 messages
#1Tomas Vondra
tomas.vondra@2ndquadrant.com
3 attachment(s)

Hi,

I've been doing some power failure tests (i.e. unexpectedly
interrupting power) a few days ago, and I've discovered a fairly serious
case of silent data loss on ext3/ext4. Initially i thought it's a
filesystem bug, but after further investigation I'm pretty sure it's our
fault.

What happens is that when we recycle WAL segments, we rename them and
then sync them using fdatasync (which is the default on Linux). However
fdatasync does not force fsync on the parent directory, so in case of
power failure the rename may get lost. The recovery won't realize those
segments actually contain changes from "future" and thus does not replay
them. Hence data loss. The recovery completes as if everything went OK,
so the data loss is entirely silent.

Reproducing this is rather trivial. I've prepared a simple C program
simulating our WAL recycling, that I intended to send to ext4 mailing
list to demonstrate the ext4 bug before (I realized it's most likely our
bug and not theirs).

The example program is called ext4-data-loss.c and is available here
(along with other stuff mentioned in this message):

https://github.com/2ndQuadrant/ext4-data-loss

Compile it, run it (over ssh from another host), interrupt the power and
after restart you should see some of the segments be lost (the rename
reverted).

The git repo also contains a bunch of python scripts that I initially
used to reproduce this on PostgreSQL - insert.py, update.py and
xlog-watch.py. I'm not going to explain the details here, it's a bit
more complicated but the cause is exactly the same as with the C
program, just demonstrated in database. See README for instructions.

So, what's going on? The problem is that while the rename() is atomic,
it's not guaranteed to be durable without an explicit fsync on the
parent directory. And by default we only do fdatasync on the recycled
segments, which may not force fsync on the directory (and ext4 does not
do that, apparently).

This impacts all current kernels (tested on 2.6.32.68, 4.0.5 and
4.4-rc1), and also all supported PostgreSQL versions (tested on 9.1.19,
but I believe all versions since spread checkpoints were introduced are
vulnerable).

FWIW this has nothing to do with storage reliability - you may have good
drives, RAID controller with BBU, reliable SSDs or whatever, and you're
still not safe. This issue is at the filesystem level, not storage.

I've done the same tests on xfs and that seems to be safe - I've been
unable to reproduce the issue, so either the issue is not there or it's
more difficult to hit it. I haven't tried on other file systems, because
ext4 and xfs cover vast majority of deployments (at least on Linux), and
thus issue on ext4 is serious enough I believe.

It's possible to make ext3/ext4 safe with respect to this issue by using
full journaling (data=journal) instead of the default (data=ordered)
mode. However this comes at a significant performance cost and pretty
much no one is using it with PostgreSQL because data=ordered is believed
to be safe.

It's also possible to mitigate this by setting wal_sync_method=fsync,
but I don't think I've ever seen that change at a customer. This also
comes with a significant performance penalty, comparable to setting
data=journal. This has the advantage that this can be done without
restarting the database (SIGHUP is enough).

So pretty much everyone running on Linux + ext3/ext4 is vulnerable.

It's also worth mentioning that the data is not actually lost - it's
properly fsynced in the WAL segments, it's just the rename that got
lost. So it's possible to survive this without losing data by manually
renaming the segments, but this must happen before starting the cluster
because the automatic recovery comes and discards all the data etc.

I think this issue might also result in various other issues, not just
data loss. For example, I wouldn't be surprised by data corruption due
to flushing some of the changes in data files to disk (due to contention
for shared buffers and reaching vm.dirty_bytes) and then losing the
matching WAL segment. Also, while I have only seen 1 to 3 segments
getting lost, it might be possible that more segments can get lost,
possibly making the recovery impossible. And of course, this might cause
problems with WAL archiving due to archiving the same
segment twice (before and after crash).

Attached is a proposed fix for this (xlog-fsync.patch), and I'm pretty
sure this needs to be backpatched to all backbranches. I've also
attached a patch that adds pg_current_xlog_flush_location() function,
which proved to be quite useful when debugging this issue.

I'd also like to propose adding "last segment" to pg_controldata, next
to the last checkpoint / restartpoint. We don't need to write this on
every commit, once per segment (on the first write) is enough. This
would make investigating the issue much easier, and it'd also make it
possible to terminate the recovery with an error if the last found
segment does not match the expectation (instead of just assuming we've
found all segments, leading to data loss).

Another useful change would be to allow pg_xlogdump to print segments
even if the contents does not match the filename. Currently it's
impossible to even look at the contents in that case, so renaming the
existing segments is mostly guess work (find segments whrere pg_xlogdump
fails, try renaming to next segments).

And finally, I've done a quick review of all places that might suffer
the same issue - some are not really interesting as the stuff is
ephemeral anyway (like pgstat for example), but there are ~15 places
that may need this fix:

* src/backend/access/transam/timeline.c (2 matches)
* src/backend/access/transam/xlog.c (9 matches)
* src/backend/access/transam/xlogarchive.c (3 matches)
* src/backend/postmaster/pgarch.c (1 match)

Some of these places might be actually safe because a fsync happens
somewhere immediately after the rename (e.g. in a caller), but I guess
better safe than sorry.

I plan to do more power failure testing soon, with more complex test
scenarios. I suspect there might be other similar issues (e.g. when we
rename a file before a checkpoint and don't fsync the directory - then
the rename won't be replayed and will be lost).

regards

--
Tomas Vondra http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

xlog-fsync.patchtext/x-diff; name=xlog-fsync.patchDownload
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index f17f834..b47c852 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -3282,6 +3282,8 @@ InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
 	}
 #endif
 
+	fsync_fname("pg_xlog", true);
+
 	if (use_lock)
 		LWLockRelease(ControlFileLock);
 
ext4-data-loss.ctext/plain; charset=UTF-8; name=ext4-data-loss.cDownload
xlog-flush.patchtext/x-diff; name=xlog-flush.patchDownload
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index f17f834..dec7721 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -10626,6 +10626,19 @@ GetXLogWriteRecPtr(void)
 }
 
 /*
+ * Get latest WAL flush pointer
+ */
+XLogRecPtr
+GetXLogFlushRecPtr(void)
+{
+	SpinLockAcquire(&XLogCtl->info_lck);
+	LogwrtResult = XLogCtl->LogwrtResult;
+	SpinLockRelease(&XLogCtl->info_lck);
+
+	return LogwrtResult.Flush;
+}
+
+/*
  * Returns the redo pointer of the last checkpoint or restartpoint. This is
  * the oldest point in WAL that we still need, if we have to restart recovery.
  */
diff --git a/src/backend/access/transam/xlogfuncs.c b/src/backend/access/transam/xlogfuncs.c
index 329bb8c..35c581d 100644
--- a/src/backend/access/transam/xlogfuncs.c
+++ b/src/backend/access/transam/xlogfuncs.c
@@ -195,7 +195,7 @@ pg_current_xlog_location(PG_FUNCTION_ARGS)
 }
 
 /*
- * Report the current WAL insert location (same format as pg_start_backup etc)
+ * Report the current WAL flush location (same format as pg_start_backup etc)
  *
  * This function is mostly for debugging purposes.
  */
@@ -216,6 +216,27 @@ pg_current_xlog_insert_location(PG_FUNCTION_ARGS)
 }
 
 /*
+ * Report the current WAL insert location (same format as pg_start_backup etc)
+ *
+ * This function is mostly for debugging purposes.
+ */
+Datum
+pg_current_xlog_flush_location(PG_FUNCTION_ARGS)
+{
+	XLogRecPtr	current_recptr;
+
+	if (RecoveryInProgress())
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("recovery is in progress"),
+				 errhint("WAL control functions cannot be executed during recovery.")));
+
+	current_recptr = GetXLogFlushRecPtr();
+
+	PG_RETURN_LSN(current_recptr);
+}
+
+/*
  * Report the last WAL receive location (same format as pg_start_backup etc)
  *
  * This is useful for determining how much of WAL is guaranteed to be received
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 790ca66..985291d 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -235,6 +235,7 @@ extern void GetXLogReceiptTime(TimestampTz *rtime, bool *fromStream);
 extern XLogRecPtr GetXLogReplayRecPtr(TimeLineID *replayTLI);
 extern XLogRecPtr GetXLogInsertRecPtr(void);
 extern XLogRecPtr GetXLogWriteRecPtr(void);
+extern XLogRecPtr GetXLogFlushRecPtr(void);
 extern bool RecoveryIsPaused(void);
 extern void SetRecoveryPause(bool recoveryPause);
 extern TimestampTz GetLatestXTime(void);
diff --git a/src/include/access/xlog_fn.h b/src/include/access/xlog_fn.h
index 3ebe966..f4575d7 100644
--- a/src/include/access/xlog_fn.h
+++ b/src/include/access/xlog_fn.h
@@ -19,6 +19,7 @@ extern Datum pg_switch_xlog(PG_FUNCTION_ARGS);
 extern Datum pg_create_restore_point(PG_FUNCTION_ARGS);
 extern Datum pg_current_xlog_location(PG_FUNCTION_ARGS);
 extern Datum pg_current_xlog_insert_location(PG_FUNCTION_ARGS);
+extern Datum pg_current_xlog_flush_location(PG_FUNCTION_ARGS);
 extern Datum pg_last_xlog_receive_location(PG_FUNCTION_ARGS);
 extern Datum pg_last_xlog_replay_location(PG_FUNCTION_ARGS);
 extern Datum pg_last_xact_replay_timestamp(PG_FUNCTION_ARGS);
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index d8640db..ca8fcd4 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -3154,6 +3154,8 @@ DATA(insert OID = 2849 ( pg_current_xlog_location	PGNSP PGUID 12 1 0 0 0 f f f f
 DESCR("current xlog write location");
 DATA(insert OID = 2852 ( pg_current_xlog_insert_location	PGNSP PGUID 12 1 0 0 0 f f f f t f v s 0 0 3220 "" _null_ _null_ _null_ _null_ _null_ pg_current_xlog_insert_location _null_ _null_ _null_ ));
 DESCR("current xlog insert location");
+DATA(insert OID = 3330 ( pg_current_xlog_flush_location	PGNSP PGUID 12 1 0 0 0 f f f f t f v s 0 0 3220 "" _null_ _null_ _null_ _null_ _null_ pg_current_xlog_flush_location _null_ _null_ _null_ ));
+DESCR("current xlog flush location");
 DATA(insert OID = 2850 ( pg_xlogfile_name_offset	PGNSP PGUID 12 1 0 0 0 f f f f t f i s 1 0 2249 "3220" "{3220,25,23}" "{i,o,o}" "{wal_location,file_name,file_offset}" _null_ _null_ pg_xlogfile_name_offset _null_ _null_ _null_ ));
 DESCR("xlog filename and byte offset, given an xlog location");
 DATA(insert OID = 2851 ( pg_xlogfile_name			PGNSP PGUID 12 1 0 0 0 f f f f t f i s 1 0 25 "3220" _null_ _null_ _null_ _null_ _null_ pg_xlogfile_name _null_ _null_ _null_ ));
#2Teodor Sigaev
teodor@sigaev.ru
In reply to: Tomas Vondra (#1)
Re: silent data loss with ext4 / all current versions

What happens is that when we recycle WAL segments, we rename them and then sync
them using fdatasync (which is the default on Linux). However fdatasync does not
force fsync on the parent directory, so in case of power failure the rename may
get lost. The recovery won't realize those segments actually contain changes

Agree. Some time ago I faced with this, although it wasn't a postgres.

So, what's going on? The problem is that while the rename() is atomic, it's not
guaranteed to be durable without an explicit fsync on the parent directory. And
by default we only do fdatasync on the recycled segments, which may not force
fsync on the directory (and ext4 does not do that, apparently).

This impacts all current kernels (tested on 2.6.32.68, 4.0.5 and 4.4-rc1), and
also all supported PostgreSQL versions (tested on 9.1.19, but I believe all
versions since spread checkpoints were introduced are vulnerable).

FWIW this has nothing to do with storage reliability - you may have good drives,
RAID controller with BBU, reliable SSDs or whatever, and you're still not safe.
This issue is at the filesystem level, not storage.

Agree again.

I plan to do more power failure testing soon, with more complex test scenarios.
I suspect there might be other similar issues (e.g. when we rename a file before
a checkpoint and don't fsync the directory - then the rename won't be replayed
and will be lost).

It would be very useful, but I hope you will not find a new bug :)

--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Michael Paquier
michael.paquier@gmail.com
In reply to: Tomas Vondra (#1)
Re: silent data loss with ext4 / all current versions

On Fri, Nov 27, 2015 at 8:17 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

So, what's going on? The problem is that while the rename() is atomic, it's
not guaranteed to be durable without an explicit fsync on the parent
directory. And by default we only do fdatasync on the recycled segments,
which may not force fsync on the directory (and ext4 does not do that,
apparently).

Yeah, that seems to be the way the POSIX spec clears things.
"If _POSIX_SYNCHRONIZED_IO is defined, the fsync() function shall
force all currently queued I/O operations associated with the file
indicated by file descriptor fildes to the synchronized I/O completion
state. All I/O operations shall be completed as defined for
synchronized I/O file integrity completion."
http://pubs.opengroup.org/onlinepubs/009695399/functions/fsync.html
If I understand that right, it is guaranteed that the rename() will be
atomic, meaning that there will be only one file even if there is a
crash, but that we need to fsync() the parent directory as mentioned.

FWIW this has nothing to do with storage reliability - you may have good
drives, RAID controller with BBU, reliable SSDs or whatever, and you're
still not safe. This issue is at the filesystem level, not storage.

The POSIX spec authorizes this behavior, so the FS is not to blame,
clearly. At least that's what I get from it.

I've done the same tests on xfs and that seems to be safe - I've been unable
to reproduce the issue, so either the issue is not there or it's more
difficult to hit it. I haven't tried on other file systems, because ext4 and
xfs cover vast majority of deployments (at least on Linux), and thus issue
on ext4 is serious enough I believe.

So pretty much everyone running on Linux + ext3/ext4 is vulnerable.

It's also worth mentioning that the data is not actually lost - it's
properly fsynced in the WAL segments, it's just the rename that got lost. So
it's possible to survive this without losing data by manually renaming the
segments, but this must happen before starting the cluster because the
automatic recovery comes and discards all the data etc.

Hm. Most users are not going to notice that, particularly where things
are embedded.

I think this issue might also result in various other issues, not just data
loss. For example, I wouldn't be surprised by data corruption due to
flushing some of the changes in data files to disk (due to contention for
shared buffers and reaching vm.dirty_bytes) and then losing the matching WAL
segment. Also, while I have only seen 1 to 3 segments getting lost, it might
be possible that more segments can get lost, possibly making the recovery
impossible. And of course, this might cause problems with WAL archiving due
to archiving the same
segment twice (before and after crash).

Possible, the switch to .done is done after renaming the segment in
xlogarchive.c. So this could happen in theory.

Attached is a proposed fix for this (xlog-fsync.patch), and I'm pretty sure
this needs to be backpatched to all backbranches. I've also attached a patch
that adds pg_current_xlog_flush_location() function, which proved to be
quite useful when debugging this issue.

Agreed. We should be sure as well that the calls to fsync_fname get
issued in a critical section with START/END_CRIT_SECTION(). It does
not seem to be the case with your patch.

And finally, I've done a quick review of all places that might suffer the
same issue - some are not really interesting as the stuff is ephemeral
anyway (like pgstat for example), but there are ~15 places that may need
this fix:

* src/backend/access/transam/timeline.c (2 matches)
* src/backend/access/transam/xlog.c (9 matches)
* src/backend/access/transam/xlogarchive.c (3 matches)
* src/backend/postmaster/pgarch.c (1 match)

Some of these places might be actually safe because a fsync happens
somewhere immediately after the rename (e.g. in a caller), but I guess
better safe than sorry.

I had a quick look at those code paths and indeed it would be safer to
be sure that once rename() is called we issue those fsync calls.

I plan to do more power failure testing soon, with more complex test
scenarios. I suspect there might be other similar issues (e.g. when we
rename a file before a checkpoint and don't fsync the directory - then the
rename won't be replayed and will be lost).

That would be great.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Greg Stark
stark@mit.edu
In reply to: Tomas Vondra (#1)
Re: silent data loss with ext4 / all current versions

On Fri, Nov 27, 2015 at 11:17 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

I plan to do more power failure testing soon, with more complex test
scenarios. I suspect there might be other similar issues (e.g. when we
rename a file before a checkpoint and don't fsync the directory - then the
rename won't be replayed and will be lost).

I'm curious how you're doing this testing. The easiest way I can think
of would be to run a database on an LVM volume and take a large number
of LVM snapshots very rapidly and then see if the database can start
up from each snapshot. Bonus points for keeping track of the committed
transactions before each snaphsot and ensuring they're still there I
guess.

That always seemed unsatisfactory because in the past we were mainly
concerned with whether fsync was actually getting propagated to the
physical media. But for testing whether we're fsyncing enough for the
filesystem that would be good enough.

--
greg

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Greg Stark (#4)
1 attachment(s)
Re: silent data loss with ext4 / all current versions

Hi,

On 11/27/2015 02:28 PM, Greg Stark wrote:

On Fri, Nov 27, 2015 at 11:17 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

I plan to do more power failure testing soon, with more complex
test scenarios. I suspect there might be other similar issues (e.g.
when we rename a file before a checkpoint and don't fsync the
directory -then the rename won't be replayed and will be lost).

I'm curious how you're doing this testing. The easiest way I can
think of would be to run a database on an LVM volume and take a large
number of LVM snapshots very rapidly and then see if the database can
start up from each snapshot. Bonus points for keeping track of the
committed transactions before each snaphsot and ensuring they're
still there I guess.

I do have reliable storage (Intel SSD with power-loss protection), and
I've connected the system to a sophisticated power-loss-making device
called "the power switch" (image attached).

In other words, in the last ~7 days the system got rebooted more times
than in the previous ~5 years.

That always seemed unsatisfactory because in the past we were mainly
concerned with whether fsync was actually getting propagated to the
physical media. But for testing whether we're fsyncing enough for
the filesystem that would be good enough.

Yeah. I considered some form of virtualized setup initially, but my
original intent was to verify whether disabling write barriers really is
safe (because I've heard numerous complaints that it's stupid). And as
write barriers are more tightly coupled to the hardware, I opted for the
more brutal approach.

But I agree some form of virtualized setup might be more flexible,
although I'm not sure LVM snapshots are good approach as snapshots may
wait for I/O requests to complete and such. I think something qemu might
work better when combined with "kill -9" and I plan to try reproducing
the data loss issue on such setup.

regards

--
Tomas Vondra http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

power-failure-device.jpgimage/jpeg; name=power-failure-device.jpgDownload
����JFIF���ExifII*��	LGE���(1�i��Nexus 5HHShotwell 0.15.1������'�P�0220��
���	�
����371��371��371�0100��	��	�	�������;�d���d��R980100��	�http://ns.adobe.com/xap/1.0/<?xpacket begin="���" id="W5M0MpCehiHzreSzNTczkc9d"?> <x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="XMP Core 4.4.0-Exiv2"> <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"> <rdf:Description rdf:about="" xmlns:exif="http://ns.adobe.com/exif/1.0/" xmlns:tiff="http://ns.adobe.com/tiff/1.0/" exif:PixelXDimension="3264" exif:PixelYDimension="2448" tiff:ImageWidth="3264" tiff:ImageHeight="2448" tiff:Orientation="1"/> </rdf:RDF> </x:xmpmeta>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 <?xpacket end="w"?>��C




��C		

����"��	
���}!1AQa"q2���#B��R��$3br�	
%&'()*456789:CDEFGHIJSTUVWXYZcdefghijstuvwxyz���������������������������������������������������������������������������	
���w!1AQaq"2�B����	#3R�br�
$4�%�&'()*56789:CDEFGHIJSTUVWXYZcdefghijstuvwxyz��������������������������������������������������������������������������?��=@���R���FI���'~��	�Ef�����%���'s��>��_|X��|U�������i��
���z������{�*B��6r�}K�Ox�����������u&;Pw��D���_��#�Tw7��:&��g%K���dj����y-��w=������B����sT�k��]Tz!�I!f���M�9s�i�����:��c��*1�^*D�e=x����OL��+�v/-��z�.�8��^H<}i����I2����q�\T����'��D�R����Z����R�UdUr��j�s��Z�����c�����L����ec��t;�$j�w��Lf�����&�Ep��<R�6��k���;����V����,=���R���[��w5<���:��u����z���)&BR������\��������3������+Ic�>��k1t������HN�+[K%�P9������9�&n[����������)��8pG�������s����W]��
�����n�?���o�>�Y����
rI>����k^��-��x�����������������i����w5�k�^F�-��j�c��&�w/'�\w���t��-�q.�o�[y�Y\�����{
��o���j�^j�y�3i�)�FE��[<����Q�����]��U��F�?���������Z��7�^�]Kqu3y��V�F7H��U�{[���%������G`aV�?����EE���dg�����|y�"F�&�>�*���M�����9�������&@�`q���k�x�n^��Y�%x��KDy���u-X��pvvE;T~Ub2`�$w��D�I�7��[��R��OOoj��c�8%'7y=@@$c8�����=�g>2E5�x?���<u MH��v��R���c��+���S�M"�y������ c��t�K�{��6w9]�95���cM��)���l�s��~�)��O�?D�Y�������x��
1�c��R��=f�|C����~>����$��c�>��~|��������7�n[U�p��H��{��f��B��T!����2��9��2��f��>�2tm�E�����+[tX�P�+S8!�Q�
7��L��}k������N���� ���Z�
�'�`�����T�C����$g��)����H%V�qWm�����}i&�bi���q��! �d�'�03�GzM!&g�q�?QVb��P����3��U�~K�����"����Ad��i�����2�8�P�W=)���)7���5�����4:Ln#����
�T���6����PP}��a�F[\�$ylA<���);��k������n8���.������-��{�P��$~jW��qDW>VQ�U���O��#�*'O<�2c?ZM%�&X1���*�� B�~�����������CQ��"t���'�j�{gon�M���VF���6
� ���R���_�f�g��j�����3���Qg�.���K;��5���}�c_x���$���kzt�����z�{�PrAS���/�4_X=����_F��,�G��x���q����������x$+w
Vb@2T��
}����O�~%�.�qq����W����y�x'��a?�sHt��]R5?(I69C^�k��ZL��������Un���`�����yc9�"�=c�z����au�{��~`�+���+�$�����sl���%m�����MkQBO��1��q��)�������g�����D��������Z�O��kE���d�B!`?Z�+F��#M�y��Nd��?���5j��U7��3�O�^��>%��DW��N��y��J�~�I�?��w�������:��i���r��>u���{�-El�K)5;���#����}��c��}����[�Q�$��A�j�
���+���t���l!��P~������}���10���������(ru?%>%k�)����S�\�w!RNp����~U[G���������F�����pf>�9��$�I�8����N*���?�>�:�
iBH�*�Z�� ����?M*���V��@0$�<�z���+��;=����m�q��Y�2P*��9&�$�d��1��$(8RA�y��I-)w>I���u�g��K��b�,�Y�����D��MT?2�����3\S|�������v�t����1�8{�����Hv�����2�=j J�:w�3h����%X��W���A�5Ka:�d�zS3�1�K�'4���8���"sQ�	�����
��<��F�H5�=�X���S�KB�/�{���
O�R�s�V�@�>�-��6�	�v
q�5�c��C�>�����I�C[��l�}*[kI%v���sp�����`A���'���l���$K�$b������C�����A��%�����2k��4�!">�s�B�[�l��2N��`��n��Ia>F���+V����Z,���uL4��z�� �������$6v��w3�!�}s^aq����#�X������V��;[��kd�i�$�[Y�-H�h�O�Q��S���Z���|]�]��t�D�\��%����������(�/i1xC�����j�������G��^��c��9�Z�&����,-e��s��)$���`����lH��Y�E�?S��s���::����B��k��~�6�F
S�f���������+�?������I�.p�{
�1X��v���������
x7K������$0 ,{�{�M[��#��t�J�.m�R���=�r�vH��B���%Y@�'�z��|��{I%��%x��F�4��.<#:_�9$Y�6=�t?�)���|U�yM���lH��l��?Z����K�@p�eHWnO��w���c����e�����O3�_�����Nl��{�O|����Rxl�1i!@����T��7�#
����J��j�5��t���U���]����BHQ(9���P��d���pEX� n�<)�5��v�C/��;bH��Ut]������'�S����IM�gg�*h�1����Ds�~�LS�r8�{{�����5#���B�y��J��	����[|�����"/�O�J��0O�M�2���5���[�f��8 �{��z�1Bp�J�`1���9Y�INps�i���f�l�w���v������
�����
�q�Z�2j�`�6��pEYG'k/+��y,@+���G�,�gilV�����GbC!��C�3���B�r���*���W�I�H�
�q�r*)�,�z���Ntb7!��"�or%C���*7+���p�b�	�����%�#r�G z�Q�����
X��L�r~���	����zc7rT�������;S��''+���� ��c�r8�H�{�������H����s���Q(E�3���P��@�8<q�����g�}*I8��Jq�a6�F���������1X����J��
�����%]�<c>���#1O��'��d�\S��h�;����Z����W���~`q����O���hU�de�Nv�(zt>�[�{w���<7j��h�+;I[
o}6i�g�Q(^���x�>2rG���9��E�U���>zH�``��� �9,�!�>�������2����G�85����K�>���4����$��	�5���R��N��
|;�I�Q��.U�D��>���]RoRf��ks�J�x���C�&r���95�Qo�O��6����$����[~U�EP0��O�����������;�
�UlOz�m�w+�#-�Ki2,q���an"O�"n'��>���-5��t���5��Tbo�#��W��������'8�2��pe��zT��.��17Q���� c#�Uo5f�0+�J�2��;sN�)��(�����	5�uHG�`~m��+m����b�LP�Jg���
w�#���;/�Y��Wx�S��K"A�j�Z���	��{x=�D;NsY�������i���K ���������c�����'"��9��wf�s�< ��U�u�I_��qs�9�����vgQ������\�:����`i=�s��W0��CSF�c49>�$o?���m�Q�B6���TZ�K��������j;K��V;h$��E�I&��
|�o�J�Mk8s���,��X������N������\j��X\�w[yn's���'>���?e=.�}sS{�8&��mO�z������< Z
+J��bP�s�c�y�1����z�r��Z���?�e{�)�� �����������~�5���4�o������{�����������9!y��k8���I�^UZ�+�7dz�������B�K���9���
F�0�"��o�i��Q,��
p~�������=�r�(��7��h ��Q�R�$��1�{Tw*���D`1�h��3�a���%�����:�����T$�����2U$�,;�X_l��a��k�7th@�/��2��V����������@G�W�s���V����bd�@��>�a���\������q����'9$��^���L�e��:
�'?_J\��@�w&���������3�j�H����j��{������4Ik&q��T��4�|�����8��U� u�9����(I�V&PH�kB/��*��FQ�q�L�b�@���b����G�$
i��\�?�c8����g�jAY�0��p���o3��s�T����	PG�^�����h����QH��h$�qU���HU�:����'������C�S�!��S)�����e1�ldw��9�O�x��	$�����>��2ex=3�S��]�*9#1�1���
c�C��LT�VM������2!���S����=��*����q���~�1������&�oqHe�;�Z7(�e�����3qP�1��������{��U�#��5�.B�������/��I	k�}����0zb�$*���xzSk��K*�pz>0*101�#����2����5#rGG^9�*��!�� ����e1�y�
���L��th��� ��4�#���=�]�I�����0��Z6I|�q����c���c���[�_3����2��IJ\��FqE����� s�T������+�{T�fE�����\����r@��w]%
��s��48����;�2@�����>^|y��r9�����/�ke$�&��}+�/�w��x�R��1l�9$�J����������8�S��9'�]/��6��s0��g�9^��An���V��J��8�+�h��c����u6+�~l�5�#M�������w���T�f �A\c�����7;��8�J�T49�9�,K�����Z��� t �=kz�LhN��$�J�.�ynI9_�Z$M�|5�5f��������j����ys0V.���)�)�Edn�#4�R:t���N�*=G�s�������S��8�7b�Wy�!�Y���HGZp�8��Wz��dKm������fS�V���>`jO2�0�6�rD��iN8�����Z�Mq�\}kj����4c�4�&��k
���^;)���9'�sW��J��K9[�m����JM���7���N�t����R���&���7��4���s�l�G�l�������;F&���X����+�lu�nl�V����]����jCj�e�����r�S�;������pW�(U�����A3����7�S�Kd����;���`WIa�mH��+������kmsF�X�MF�BD��~5��Y���S�a����2]�Vz+F�z|��~� +��Oy��b�s���Z��l�+�����U��t��)��.a*��$k�Sswgr���$��V7'�q����]	�	n���{�p��c ����]%��r"��C`}�����T�cJ��/#j�8��j�LQ������KY^ ����VG��9�
��rJ�,�*��1�n�F7�6?w��j|��q�4������S)_PA�)tG<������=���qH��#�@�=���cnN2J��y;Gk�K<�U�f��<��R�2� 8$��:b��N��~x�hD��R�6{zU��/��Z��`�tC�j�-H�9�Uc9#M�����M��1?+ph�E�
�� �A6�pEY�1NO�-�u����%}jx��<w��'�H�\��*u��;��L)�	�����x�&
�����1�jh��1��Z|���1����M;��M�����T�y22�����1�d{���p2
���p~���1���U#%��*eb*V�����E,L��zb��=�cv'�jIF�"'.�3�Z�gte�p{f��)�N���>�T�A�Kgr�F��+���h��O�=j[i�x�8=���'rv5WDl5X��s����B�U
�p9
�"����� RN��d*�t�HCA!�/p;���l{XrG�#��s�E��\P�,e�#��#C�pG�U���v����@���=���#PY���Qky[���)F�F��jB���������-� �e[���C����A��#r����Z$!d���=��'��������D��R�1��oz�X��a�\�{L��*�������O=N�%Q���Y[a��'�R�(%����1$x���{O �\��Nf S�������Rd3#�"�^���O"����q�{
bJ�z2�l���������(-n�9����naY�_xPw��uR�+v=�$syC�����G9��G��4�#�>�#j")����MrPg�_�}����Z�s�
w�����X|aL�#���NO��]�h�KS�QB�g7e3m(f�����,���F�����5;v�r�G�����y<�
�I�T.k��|o���M�H��s`���k����������G�����&$�"�������L�W���fa:JT���O���b��}n�����O��1��:TW_������Dm�s\v��3��u�TL9#�+�L�w@�X�?��{���t���H�)\���n��@9D����Ha%�����-��Z���xf�SL�`NY���h�E����F|n>�$r�}*��Jp��9?jb{S����X)��QdW3,	��<��7��Q*��4��4l;�A!���$9��`ZpCJ���o��>}zRy\R�e~����~���N}i<�1N��)!��I<S��$�
F��	��1����e$��pq�����sQ|t4��?������D�p�3O�Cz����zpg���wH�2�?�����t�VVl���C�@��h�I�����p����=z�yH������#�)���>���W���_�+�w��0��j?j���T�A�b����Wi��"��%K��w�Ft)��
5�AGBE���Z����#;��z������k���K�6<�[��o�?A]����5�y%�"��K���������-W�m���o�)����td��&�I�?j~��\����(�������\1����G�n���nt;���������X��9������������X{l/[~��*�N-~>x��/��r&5,�h�����wL�Nk������D����Q����\��������5.�}?�b]LX�h���|Ev@�5`��8����H�T�~�?�����������e`zbx���K�
���g�Z����'�T����?�������?�>�����T���|wo��������#Q������5{���4�����\_y~&���� "�]}������T��������%�z�(�������b���?�	5]�	��1���W�n��S��v��E�]��v�����;���]��������g|E�����������o���D��C����T��M����(}#�/e�e����t������n-4��rae?�W[�~��U:��m����d�`����b �"{�5iw
IV���x\4�
W�����'������v��1���U� ���^���o��>#�7D�o%���b�1�p��{��)������~�V��O���a���#����9�IRE�1������+�U���}����'�\��^�sn���\�m�S�k��������X���[J�_��0��%�
���j��:�����=��EE+)��{���� ������Z�����9��e$��0��0e<me=+�j���{��!b����ZV$�zU ��[�:���z������8��8�i��%C�c�})P��I��T��Lc;EY)��W��@8���������<52�Pn��ap0Oz�K��p��OJc�
������A�����n;z��Q7&S�2�dg�R���Q2�yN�v7_cV������EV� ��N�3��G�F7+t�<V& ���~uy]�u������P3�z}i��^����������J��a�=��hG����#+'�O�29<�*��
�D_��A���lcY�'_���!��������3dg#G�7�3���r9b��{��R;��*�D6��G�z#o �|����G*,nf�;OU�(�(�G�#��?Z���o��z�4r	0���:���L7�t��u������	����?�)cr��;G�8br$ ��|\*�������#K����t^���[E������F�@��R��'�	T�q�Ms��h-y�<?u��U �Z���������%�>��}��Xx����YL�������K+8�5�� �ug����Qc����E.������F��{�+���3�m��_�f�O�k��on�����4M�YN3ZG���`N�\1������:\��S��[��^�k7�����F�;	#?J��.���'�%�i�������+����o�T���b�
����$E�v��f��7�!K�O�Yq�b���t�THH#<���8$s@V<�����(����q=i�b���u�FyH����p9����"��f0jA�O�s6�t����NX���A4��J6/�F8�H5&9��=�q�M��#[`@�
����h���DGq�ZvV��
E�7�4��%nTzt�[6�D^`&%<�T�e�����V�����f�*T=JtR�&��@@c7lm�]M��������wQ����p��z���k@��:�J�z�z�{,<1b6g�J
�S�:h��v�lI9������/���� )h���e95v��Mi����E��I���������Z0|1������D���88�[��kx�m������9B��+	U�I�p�r+�k�W�{�����6V ;��jq�?x��':���Ye��k���u�i�A���Zu���r)��O���L�J��(�y�e���.�,.s��;��Q��3�]��0���w�(?�W��>Xh��=���&\�"�b�-�������>a���t�V��7n�<�Cc���5���_��u��l�W�|�h������\��$�H����@��I��Q�q�E_�������T�O�uO�o��7��%���L��8�+[��o��L�[�%?�,,�8�~�ZI��#��;���d��hf�:�d+����c���L�ym7����M;]��yU�/.-d�����z7��iOx^E����K��x�A������>
�/���ZX�$)�����&���]����wU�����cz��������H��Z��"���l��<���f�����|�����{w�>6xG�,��X�LH"���o���W��-���$�M{�U���?�/����-��X<R/\����U\����u�{�"~��.�I�90H��MIl5U u����|}�w��#�Q{�5��\���A�����/��A�JGk�)�o��K�����[R�����*�M6g�I�n��������PQ�R���7K�A*��2��1L�V���dpk�Du_��n��dw8 �����>�����fI��-���J+�����N����"�Y���������o�~V
9�b�B�g��M����U�C:���
jp���Z|��V�L�x�SQ��|��D���R0Zd��z��k�G��d1����I��Y>���1���l�e�����$�
y7�?co���X�H$y'��ps��}����s��<,B����tGR:�a*0�������z$�
�����
����{��`����Z�B�g��[��-RF�0\:�5�����>�����D��7��g�g�P��Ir���1�2�������t���]:v
���=�
}�eu�Q\@|�gQ"8�5�goky����\��o�
rM���v����b2?\���G����p�C��`���:T��E�����8n��U����k�W=VH��G���7 `�TA��)Ca�?/���������#�jO0��4�}�3���v���V'��,r�3�W�	 c��W�����C7�!��?��o]If�<_0�Tl�����] m�:�
#<�N������;S�3�a���C�+���|C�]�0���b�MC�<��G��?���m����>f?"=
Hms�W������>���p��nD�Tg�����^#����Y�G��4�9��t�?�]�������<U(�>������Fs�=*������h�P3�F
�5���~�^>��,�!���-XB�����O�����s}sr�rL����T0z�I��G�N����j��x�J�����`�Y�9hY�Y<SbGF����A�/�Z����������8bw����e��#�	t��z��O���������R��\����|=��F�U�S$����+��u{��XzsS�Mt,��������	�X��zo���\,�N�i~3���\��["Uu,��%}+�"����8��u�:��^��_�_��x��Z���He�?�'�_��/5�;����5c�~dL��s��Sc;�A���_?�.��<=�����C���&ef�m!�s�s�����Ya�CF�����W������F#Q^.���26�;g�V>�pWM�G,��{��d��n=)���o�������6N����5}5���#�`$�����8��:���:�c�tG��W}������|m���H
6�wm�C��}����Z�����y��������	�}:qL����+�|E���o�;z�?�8��>�5���V22�[K���!R?^�:h��FQz�#��$����+���m������;[�gO�]L��������M�LZ�'�ly�(L���(�EwJ-�R4��.���_��V�9R
X���+����r�8�R�O^�)�8��������c������G��P�*�+�2}��w5�.ATu�=+gN�����N�0�x����D������;�B��:u�������gm��{�4�6��@y�lZY����������g�jiVxP_���Q
���s�5��F��5���,J�c5�{�u���X��	��Kn�����GN�
,G��n]Z�xI ���c9k���M.�,k��==�pA�����m����x�lZC$syr$����wpoRKH1�s����Z2��2\rC���T0��-�!�,2*���L��W����R�&\�H�!� Q$K�*��C���8;��<}�u9E��=��{�S����F1�\�#����%��H��6C$�H#�Ii�A�q���M���nI��|�}=�F��R*��m$8Q�GZ�j7��	S�R��������r	>��I�/Q�z��$`��<����5����rV����
��)���k����3��<�W|p\���#s���z
�,R �Wi5<eg���Xb���t��RZ�|D����s����hrB��������Z+��	P���_����qC�k����#E���;�u���"���X�'j�����������
�;k����\����N�}�O���=.2���11�}k�?��<3���mC:T{�'�����s���H���J���!s����j�-C����d�w8�_�`G�����~���#�i�������_Lh����k�J��x�)�����������cn22
X<8a�qU�j��OC���������}*d�8�Ua������79����e��}���_�O���W��Z���K8�ppj]�ly08#��1J	<���ht�yz�7��7���3��4v���e��;q�J��)��$73�R4��'�����08�zW�����(<}���d�Z*�u�O�������T��w������#0��E#%�Q�R$��psX6?4�[D�~���4@��]K��>���_K��eN\�-�<V:3�,/����az���������|;Vq�,����_xo���-t�������c�I<W������/��
6yv�%G�8��y�|�QV����9����$�>�*h�p9V�g�f��rq�UaS
�������wD�q�Z�T���Ta��������#8�:b�&PJ��*iz�"��A�q�	��
�E�x��~��a)8���'��_	|0���� �+co�����>��_*�I��5�k�����hv� \Ip��'��}����}b�0���Gv}��x�D�}��u�N�N�)�@a���O�����4S$:�������1}{���|9���R���_^Oyr����]��k0�<��TZ���!����c�'�+������)���Q5��5|��%�Z��G�����,��#�H������~��Ex��n�s�N��L�����������*MBo��n�2c�����]������t�W}Y����NN@�'4ZT�6�>�3_w�C� ���$m��7���J��c'�'����A��"����v_�h����l������~��\pR?4��>)�L���z�����x�z.�� �E��U�xt�=��?@I�+�H����v��jd�0A$�\����(��
�������Q��gK��J�H����m��:/���c�[�?S_d���A����L�6�8�OqY}n�{��5$�>8����9��v���a�5�������VkK�>���P�I��}���P��H����V��}V����2����mSE��|���q\����q�8���	����H��e��A���k���~��!�?����}� <��	$��;q���8�Q����[	��,�3M�^(��+�q��W����	}��P����.��3�K{�9k2zs�=Gn�����G����Wzm����*���8��^�����������6�$�[[�V�S�xe�����^����S����i��	�Gc��Q6-�����l������|>e������O�Z5��%����Q����d�]���	�]�����'�b�K���4'�8u���J���#��ch����=B� �;���~`���[i���B�$,R
�y�4�";���A=��J���z��YWM��k��7��p/�{K��|�.O�]
�����0�@�
�N~V����x�c�x��m�&��9�����y�?`}B�����7��E�#��������d�����):������R:\�Ti�V���w�;���<�15^sm 9J�_|.�G�$a��w�����0��_��	����I{gk�����9Q�*�
u���G4�t����i��L_��N����j���j�Z��X/g�yVC�J�|�G���dM���b*[������=+�uW7+=�����8�>�E*���e����G�P}���Z�n�n���c��z�C6q���YX�h��w�R}��8�fk���ct��6%����YVq)����M�s8��I��Z�������O$���f��g%Np;�}%��#��Z�P�����pM����#oM��^�$q��w����BN+M�E��p}��x��<�y�\��QZ����4Y z�z���x7c���qT4��n�@;��z����6������L�����"��>�����0�U _Z���T$c�r+e���_�2�,��$H�3�/p1N��Ecmf��JfO����H���!�z�jD��/*y��X�j{%@��S.
�y���}*y`�A�������d����V����7�D���
���+��(rx� ���#������0
�/r{����q�`�
P�V�������V�<c��m�tQ��<�i4�j&�������Z�@<����[Y�*9�]���0pU�{���D�CMb����2}j���U�\�*Dr��nc���.���{b���m���K��95v�s!O95���	�����*
9� �X�i;������y��$����|��_�Ey�t�����Tpz��s�u+D*��NG�X�:|w��6@N1�Q�Z��*N�fs�j��?7�-/4���E,g ��t�=����X�����c&�_��u���"
�8e+��[E��f����p@�5�)C=�)�xi���-���k��8p\�'��Hd���_|+���U�<s�����+�Huk/1/����t�NVgz��n���.P@��".��p���O>�y��G��8������o���F�z��c��,1�s#�?S��C�%VjZ�9�A9Kc3�����mM3G��_�#C�}���a�>��'=?C�?kr���S}��K�l��I��C�x�Y��u���n&s,�c�c�)����C��r���r#�>��J���-����������`�B>os�qx�U����c�z"���b������{)����b����
��W|u?j��VP%��@����Q����C�o�����S���8,������>���=���������<�0!�nH=�z2v�8#���?���l����{n$���X�r��������_t[0H�p8���?�d�O�|f��camk��#p��>������|/Q���fVu��Y����}Xd:��,!	;�T�6H�{�s2ey�^b��vdP����#��p�<����0;�Z���2�Pa��a�p+�O���<R���C��4��dm����?���(�*��Q�[N��������� ��M>�)c�;�|����_'�R��<I�6�@_�G��$��\8���8���^jw:��+�$�Hr�H��=�i,ty��R���y*Y��W�GJ���g�WV��4y�^j�K$�O+����=�&�-=�`�n���_A�4��<U��
��������t�=����_�����c�[=1/�7��I&}@#�
���t�����������f/x�E5����b�"�P�`�P�_K��*����;�~�mz����B�����X���L�
}?:��b�T}�V��=R2<5��#��b�I�mt����o@}��Z�7�3�4������4$�D,9#�r=]�:�������YN�Xt�X����V�2���JN%�p@9��)�l8��?�*���T1R1����B�P�o��R+,����G�U������Dmd������t�?+0<��H�7n�fh���y=�UQ�p1���N�tR�~��9+���+�(��
�bh�����F�+���������S�<#��GV�o�K8f��v����O��05�B�1����I�����1�����3�W���r��c�UF
=���7?<���{}kJ��sj����6:�N+�>
���u��N?'���9�<$�\�����qz?���~��x����r��^�t6<r�6S<���xo[�m�����KEK9Hr�|���g�=+���|��<�����)��}'�V�F�f���orbu<�0<����^i����~�;_6�d�rL�%��;��^*��c����=�> ����u!v�|;�z�����#Z�����?�A���������9?q��~��Y�Z6��G�����������3��nOD,���Ez;�	�pGZ�����|'�����J8]�?V���oS�&��_�����DE&%;��P;����\!V��zAL���{SeX��3�Z�nj=�{u�Ae=�6� 6M~��W!� ��P���uh��J������W�n�=�	c{rp|��B;{���Q�h�����2����7i[�i���4��F?:h��O�+��T~p}S�4XE}� q�rH:�k����p��3�A�J�?d{�*A	����@�Z�J��P>���#��|�%������IN��G�W�
"�F���Hlt�:_��ZB��N��"hc�bd��EdM�X�h�(�)�V�(��Xx����M/�s�8��u���*���_E�E��|��\������I�2��I��h������������d)<��t�.X�L�q�5���8T�	l��{�t��;���<:�S,B��h��'���:���h���k���JG�B;S-�:�^��$��2����������pA�Mu~QX`� �J�����|��z`���<L]#��q\�}[�T;��z`��5�p�&���p�4���.�������>�1�L����H����F�B��J�pNc�O�S�x������]/�p���b�c��a����=)��kdp�Z<�}�X�G�)� ;S�\R20�F�aq���d0(�9�K������\y��i��� $)��=M����09"�l�-N�=�R�!�CRi���Q&Hn*+]����Q�����Y����'�M�boA#��kc�~*��1��*���C$E~`��#�w�|����Bz}*�&�������
�lT� ��G����*�N�8�����\�L���`����oMG�Lc�;�*K�e�8E�q��v��i������r����e����rXZ��������$f����'�L���Cx�%�8��vX����9a�F}��roal�1+��8#��i���G,�8��K	�_������%0FpX���B����i�so! 0�A��k��)|:�R��1w`��j�����s����j�`���Z3pw"iIY�
[�s�MX�IT
�5�G��H ���?w�Z��'�K�,x8�`+���$�p��P~�v�YG�1�s�R�]���{��K�a�W��T���
ZY���L~�8���W�Z������n5_�p�3�Y�|�3�����g��O�Z��5	ZX��"v�����t���f�Z�qp������x�s�Xx�=d���c����E�:c{���4��`�C������C���;���j�O�~����c�k����P:�����;{KX<���pI�����z�f|pm�B����B�q�?yq����+�i��������������kwI�.`P
����n;~u�:e���q!"����{~_K�}P629h�3���{����x���S\���i���#�*�8L�z�r�����+#�o�s�P�c���Mn"���YX���#>Z� ���J�I-�wq����};�6���
6��y�V����Z�i���jE������e[U�\*������*�8S���qG�N�0�Wg��-���~i��^����dls#�EQ�?J�O�������6���br��\�����O�|��j~(�%����5;�NZY�w�=��o-���1bp9&��
t�s���[:���C��K�J�Y�&���F,������'�2�L��.R��&v�E�I����d���b��c��wJ`���/�)����k����|'���F��+_���~�v���A���6���)`�Q�OC������|S�{������a	P�?���>������~�e�����#�������G��`Wm���H��z��R�J�V{����[�$�jxw��1�ipc9���<1'�PGL{W5��5���R�8���/V��������&�s>�d����f��G���V�c!O~�S$BYx���U8��E�#9�R2+t�U������CV`�8n��eP�$r;T�n��{��
yg���T}��{�S��\I�f
����z�YC��2B*M�[9����
�����V�C1�8<j�d�>x>�����9�����,.n��E�m,��Td�[���>Y��>&��hv�v]�uy|-N#R}��R�n���U{����-����d�*{c���ji�1|W��i�����|���Q�
�mC�?��[D���=j�'��1�G'�z��=�=s_W���B��bk��w<J�9������{�7�4���8������?�i��%�i��$KYa�������{��K����A�}�4�>�%��Z+����E,�W�N3�9�r�|]��
x�e�4�s�$���$'�/��^1�W�����~fyM������E���[����N$�O=G�N�M����|h������X����U�C��/�O�k�;�o�l�j~������Mp��]������M�x�[�
����<��������zz��1�Q���a!�V(�f9LH�����R��pP��j�o��:b�fh�;��_5�lA8��m)����uB~a�J�B��t5]�l�/�4�h�X���a��&V��j8�x�p��
{C�3F~q�Gzi�hBG�������W)� ����e�[��6G�X��&pO�V�Q���4)��i�����z�������w����k���������e,2���_��w~N�z����k���v��:02G������;�s�I:Q�1l��� ����d4�z(�*XF��T��v�zX!1�������G�Ey5�������*V�������=3�M�b5;N
���]�;��!��������1�%G�(�
��s���K&I]�<���CH�������Eb���v�w�Q>�[��e��$����-e�����k�����6	��?#T����f��J�6��#��6�C���^@u(9��H>�� 49
���%m*����g*E\�T��(�c�����=Z9��pbH����������N*o�Mx�V
�v9��Pod@y<�4�E��n�Y�P)i�	w������SL��8����1�*A�������U�Y% ���-�*/��JDn8�/m�L�3S��M����[�5FXC!�V �j���.����.��!29���c���N�&�Q��{���`�O�!���XOhUNpw+{�b�\F�+��oC��r`�Hr8��3py<w��B���T��;P�b�/�!+�r����-��������QP�q���U�q�n����R��#o*6�@�[���)��.��
2}�i�bx%����APYHm��*~��wE���*:����\~����D��s�����@�eB���B��]�]�d����md2�r�@?�S����&A��C!�21�RzX�R��p����SQ]D����K��<#�8�j��R_G�����}w�Ex'�lK��$���S���6����I��8:���t8�b��^��J����w@���eX���~���Q���w���0�w��c����SV5�k;�>��_Iy(���U���8_�V��Q�k�h��e�H���zz�������
�wg��1��HlniVk�����c? ���������]\%�FY��r���=�W+��#%�U�{_'=��d����}1����~mb�Dn��q��c����N�k:G�<(�[`��K8��y'�8>����<m��u�Zgl��aI�C��~.|H������9������.+�����J�z�K�K}�G�2dY������W��|;�<�"�K�1�������~����6q����{��|�{���kw
��_���f����:�BG�q�^mJJ�\�~g�
�����|B���_�7�i�u	.#��S��/eQ����?.IJ�;T����6��!j"�C�����,�b8���x���������bA��~ ��k��4G�����*g����,~�iJ�J�������~�^+���skht�#�j7jUX�:��q�_d|,��<)��"�Q�����v���EN}���U���B����D��8�^\UJ�l�j����51���������<S����{T�&J�������.2:����8����c��R��,=Gj�W����*��#��,v�/$�z`.1���MA���I��E,1����:
i�9$zT��`A�jv�\z���GJI�`����23Mrwc��Ta����������g
���drGqMb���2)����>C�=j�A##��B���{����z���L�rYO��
J�a��C&�����ZqS���{�r�q�#��x��_�������M����Dk����]����8�P2�+���#y#��#���|q��O����/�#��1��B�%��v�g�J�0�Z��mhqb1P�M��q�
<s��1n�?����a���z���Z�`7�[�t�T�[�;���_z��d���\G�!r0vc�����Q=��G���r����'��k�R�VG�9sjQ�6���� z���n��=��e�����2��<G,8�9���/�W��FU�*���1���9�������3�'
�������A�n�@�����6���N����]F�91������������W�����S���5�"!�P}	�����n�Y�m#B�>e����t������|3�[xc��ZM�~]��+
 ����'���s4�v�}[g6\h�*����	Y�����S4y�L�GJ���7'8��^	�	��#WZ�����t�B�Q�"��H��Q�3.P�d�S���p0Q��,�S���:Rm��?4m�����eGA�N����������:�~�F�r����L�&h���69����k�H1��z���}�����=c�|�0��H�n���g��[Ly��3������:�����?����*i��v�����!�W�f+�r���Y\�u�J�L�0~_��?
�����s�������;H�i0��c�O�+B6�|���B����h"�F��gO��V<��b��D.�>��6������������?�6�0�n'����l�����QA��<������
lk�/�	��z�Yq����`�������3��Lb8�Xd|���k;;!D��_��������5d���T8
��q�	��u$�����`z�����e�-v�p����Lz��n������ Wc��B=��r�\���R����f������W��*,g���������7F>T����#��Z�M��B�)�k���	$P�n���N��V���B(����[�����!Sl��Kd�H���&�U�u�Jt�$$O��j�
��C�	'�n�*�� �����A���=BV�E�NCJ�YH</9�����aB�����*��VDq�t%���41�&����jx��7In}G�-�n��,Bl���9�qY�ps��j����X��G�C�@�����O���������6)"��.F~��{�e9W��z-f3G��@ ����0eNpy��$-n(p1"��I�)v������t5��o,��F	�j������s{*���F�$�6������U��D7�3]���H��5+���m�|�Hp�����rN+�?�?������|9,�>�2�h�����Xr��������<�t���������l���3���q��;rk���OO�����iT�����	�X�{�t��;��JY��'�I�N�7�2[�;�J����9w�Ka���H�c��3������i�����33b0N~nLt\��<-c�}�Rd����s�vO�?��L��P�9\yq��=����[��+�����T����=G�[�7����<�v3,���+�'h��}9��F����?	`hU���;�aF��g������?� ���C�������������x>-{����It���e�8����F=����V0OS�sRyu��d���B�����lF�KvP3����������Fl�!������u����hz��l�k��2�F�B������yUf�nZ#��&�vxU�k:����id"(���<������R�:������rq<����n�J��������ZGsT����+���`tA����\<]��W�_��)��b�1���2�+�M'��Bi�E�u�t��}�r}�5��AXr���	*�q�D���wy��
yk{����-��W��1SL�� ��5�&�o���u��e`?��g#5��d7c^��&�W#�pq��Wq��>��y���R��CJ������j��r:�m����XD7�	#��b��������v~3��p�wn�e��as\�g$o���;�;��o�[\.��F�sZm�Wo�8�BAq�9^�T��F>�g� g(�p�pzEr����GZ�#P�+�zqVrWvx��a?/��B�sL �3���12I�~_OJ�h������x�}�����"(URF���)8��G�Bc`c��C*J��q�4���T�2X�`���|L��C�����+�A���v���-=��-wR/ �^�]~Ox���#�� V��
x(��n���<uP�zW~�����9qU��&��FH�@��F��r=?��t�wA|���'�jg���������m�]�\I��#';z����y�j���+�`v��>Q�c���Oz���|��-���+B���#����/l���"������9�mc��)��|��9�#�oz���+l���*P�G�q���,R9<;�,��7�`;cdc��U$bPp�3�c��u���lD�c�������9s�Y�������9�6:q��}0���;�����O��5����E��%a�{
���[���������_��=+��j�t��/nM��jQx_�z������dc-��|�2��.c��VTcb��u�c�6h]����O�?5��u�w1�����S�L� p���9�l\��N��=j�N���J*�d=z5M�4R�m��l��!`@�/@M,��Il��DO�J7\���h;��0�\�u��6�r���VdN����P�1�F9R3��N�L�})�L_���j�v??ga����x�N9��+�;������m*:
��Z���L�g�.���&��?�w&oK,�#<����_7��Jo��>�*�����k� �S�l�<Uh���9���Z��Z;�\���������oC������u?��@�?�����4q�*Oz��Q��A[�;��[��! �6=*ShM]��U�9V=�QdX����:T���"I[(��#�GcQ��27\��8���z���66}�:�##�������6�m�/+2�=�N@.�����h<��9� ��T����������;U������J������A���!�K:jb��� }
T�\]]3��J��H����=R��i.J�f����O+l��5�cn�p�������}���-mf���D���Z>�W[�-,]YK9�����7F����H���!�G
��k�<L}�����s�e������_�.������d~"���s�c:���v`�k9���Hk�Y5���<��H�����V�%uU]-����1��U'�*�M���s�����u���<Jz����a�����s_���g�?�/�f�}�%���5f��#6QD��!f`�{b��Zp���ca����bDx������%��p�
Oi�w_��������+��`Z�J09�?��7��G��'��<n/���+	l��9��x�>���)�<Z��6x&F��k�V^z���P��2�����>_q���u��;]�rI ;��A���KU.^�}��n~}?���XS��Zk)���A����}����+S�%�5^O�`~����P|t����7^���oXLc{�	 ?x�0G�;��L~���z�����B�V�����
1��}A"�d�o0�NM\2�5)�:�����)ht
w0��\�����Q#�kcO��WU�����AZ�F���\I�.=8�}"�T��Z Kd���>���O/�ifm�}���zo��V���l�L	y�]N�NN?����H�R7|5���x��2� ��c?�������bYT(W�K������i�xxjI��/�w������G��=""0]�*�����ls��O$�~��#�o]��z4��C����v�`x��m��%c���Tw�����=9��M0�H��Rrs���=������T]G���D�(�f&A�1��y���������e��c���
28����bX��������h��lbU

����z���z�x��>!t&�Ctg�=?�_�}��} ��m�
������j�m<EJ:���~	���
��x���V_����.���0-��]��1VS��~��_d����<���2�6�fA�a`�Y��P�x�=�9�GS����0UM[]���-q0S��j���#���<6��O�������aJ���)�.Ii<o����D?�Z�=w��6�������\f��;U�$	���q��OcI}�>�Z_h��S����k�&���+��R/���??�<L��q��U4�&����d9�X��|Ax�"w���Ml��U�b��'���f�{����kF�S�6����Fob�����8���a���f�=+�6�j�\�^�8���>Xv����{L��|/���C�x����2?�]v:w��N���Q���N��]�������������?�v��q\��R,��NE/eM��?oS�������;{�� ��8�)��k���.zP���QJ�O�g><	��)��4/�4���'��5��~��8=
�e�C��?����&���_���4��(�R���+u�	�H�������7�����0y?�U������&t���h�N`c�m��G�����*�T�n���!��������}�&�A���x^�I-��YF@
���t�g���� �S�?�T^k����S��Cb8��8^��5J������&}��?���~0�$���4�d�p�������6��yv3������"�n��TA@E
@	�F8<t���<!�]~x�GM6-B?,G&d+"7\�rov����DxC������x�1����fvc��#�]TT ����Js���|���8@��������d���&�i-��Us"�7w��;����X�N�B�w�������z����w�x�#�h���Ta��x����9�2������bA8\�c���\��o��U/�	r9#����,����B�)Y�e�@��{d���4�������f��������7-�#=�)* ��y�H���P���#0��<��:��[��&������@l�R}>��5��\��P(v$1p[$��d��b�0�-�i��>�DVm^[�Ke��������{c�Z�}��)Ds.�U���3�T�_����<E<{��X��>�"���T���MA�]����Me�U������������C�����M�9�	0�j9�$o����J��I�=���I0�3��j��=�����2�C�C��~��������`���G�("T��$�
.[r���qJ�J�|��:���|��
5I��#'?�
��hQ0�@'i�pj�d|n��v�H���=*-�h��N����h��p8�H��D)���N������?����q*���C�]��
_�����.@�?���{<���8����/�+���Wa\
�q��}�����lbo[�g����~���H#�	�8�������sV@���S<���^��VY72��|���w��Be��6I�������A8���H
���u�A�q����US��{�a�A5%f�,�qO����@2v�Q�]Kk�hh��=G�Ad^��f�s�7b����$�s�q�k>��DSd&��^MlJD�����N:V
���m#q�<k'dwN��[�Tu9R�8�=�e���8�wm-�1����E��s�����H�)�uq��x�Sd�
jjYL`���V~��^U8nG�ii@��Q��:��8�:JyV�	
U�=�w"����R�n��SPu@�j��������*����(�i}My��/�\x
	������+$3�?��\�n��i�-��	������A���0�����?9�����/3�������������	��l�l��k��h�f��1p���������x|hm�<S�x��m���+����+�������[�	��p�v��(�F�K����9�!����YS�4����T�TT���g������A�#v����j?��B�����T��;�_����u���{�8�T�x9d*�bN�q�}7���\-�{���I�b|f������c����-T����.L~ k(O?-�'��c�H�a��K�cH�i��v��x������z��?��`xV�`7���D���i������SIY�|�����D�|B@����?F�G��_&���p�go���O�?\�����.FO"������r�S���Z��%V�v?#���J����{w���+���d�/f��+�m0w�k�eRB� ���E~�[�]��+��]������2�l����	��_���D�J�m����B�4�g�P���/x���zU�����B��z���g����~�><�2��]2E�����~*H~c�}��P|n����k���K�z����k��������MYEJ�<g��T �4����I+3^�[��`���
����L���q��pTs�#���v3��I�y�M��P��?���J�l��M����!��$�����zpq^����������=3��#���~G�;�6������u�&^���c�_a���r������mmB�~a�R���#����UKDq���>��5� �D�Pt89�<��=}���?�������E�u����:�!S����v#���>3��|#�|?��:^�ot������UJ�#�����$���k����{�{V��\�Bz�?�{���(+�q������0��H�K�`$c�c��5��l�F��W�������?��u0�|�F���>���m�Yx�C��,����;N:V)��i��B�J�(��T����cx��QhZT���8�_@)����������S���>�����o����#,d��z�k?\���/����$���G����!5�}
E�����w�6��D����qU$�UO���5GG���O�}+�mm3���N]aXgp��*��M�zH�Y��MP�������!�x������#���s(II�?4nr���;�����AI������)�{/�'��DO?��t����|���hW�M��[�=+�x�i���n
�;���W�u��g)�2�'8��$My��J/�?�
S ����B~���>���4����C��i��v/�tMU�Ue��~X��������<U��CZ���P����V���#�Y^"����]����Wd����Nk��oVa��`���\����GG*�2;��q!6���L�D� R$ 9���x���j�wt���nr��v��#4s���y6X��������^�����k����UT���f���%�/����C�.��Z$|��<nC�?��W���S�Z����j^1�sh�[��*r	�Exl��������Pq�>�����x�tm;R�n$�5��������OS�z>�r��'v0����H8�{�����#!�=9����^'��,"��N���k��P:l���=�UW]L�>��:�D���\���PH�}?��]����� �pNQ7�����=I��Nk�<1�D�~#*��6�x8#G�.���z?�|Y���M�o����]N	_�`�H�GA����sV�*-3���?>iZd����M������=����h�C����,�h�k�����G�[����0�e"��h���S��5.��O+��w)$#��]�����?X�j���6�����l�����������0!U1I�1��j�0�����Q��a�t�����Y0e���	�	Rv0���_u����i��`s��7������0�9��R��K/)�:�����A#��* YU����Rz����/Jjw{���3�W��|0��O�k�?����������*����O ��N��w��t��iQ�9Y#���Y�?Q^v+T�����;9'����":Z�p���sM�K6UJ9���V/s���a#?QQ@
�����W������#������nn�v1���X#���d�,���e%\��+J>`���l�je=��5���oY^q����m�<��	�Aypax���	�������?/�Vn��VF��#��W.B���Vm�kO���U��6��0EU��"b
v�G����l�d�;�y`�#����4�@=G�J��,O�2�?�i�~��M+��f�({�\�d�S�'V\�W3�H�sI�	���FY�yI=p�Q����y�N�����q]��4�5�����d*����m-�1��?:��Kt��# $d9������u,_k�
+�<���Q��f��>�����7�4�(�(R1�:+�i��'��[�bF�w���q�^�("M��Q�_ �@�P4��F�9J�w*���8�?�k�e�>k#�������T�Y�Vc��'���g�m$X���_�?$���!�Z6�����E��������*�(����G3�v�����R���^K�#�w��8��b�i���v��:�E� u�0�<�C]���Zx�R�7��K+W�[X,�Eq�)�����p3^��i������v��8�&��.�����pH�OBx�F�����G
N�4,����_������]j�����3]]J�M&��wq��k9��7N���>)���l�XuE�n[ "L��c�$)�m���x��Eda�d��&��n��*��-�������<H��L�/)����iF��xJ��[EGv9$�b���*��A�V���=8�!��_�eg�&�#�����L�-$���7���E�����c��,��1��9�t�����O��h�J���������uvp�ROp����k�&_�����?!��c� �d���O��~ �4�������2�w��W�z��4]"���<�Y�}{~�����������p�I�v�c�9<�K����G��O=��a�De����9�0�2I5���r_�ws+�y�=�9?�������;b��C���V����{�T�d��������B�.����,+T�!��U`{����u2��,i�f���:l���ndo�DC2������.M��#P���y����=�?�a�t/�2��@X0_����~��I\w�V���[�����;���f��*{�+�_����[j��^�J��f��P�|Ok ��;Nq���w�<D��4���(�cf�����u�M�Q��G�\�A
y��e�o��V�#��z����u���:��it?��*���x��$
�_�Hp����h`q7�+N1\n������6��G�Z���r��(��-|+��|�o#��+	_��m�[��LN��~��&�s6KL��k�O�:��+e'�P���\�e�%������29�u'w9�������X}
X����d���T�*j�+��3�q�����y���=����y,�����c�����D���<�������=+N���M� #�V��b���v��Ki2�W�k��|M���&�d�Z)�w����������5�o��c���:������\4L�)�RkX�}�it���K7��5���;��������lQ���.x$�������~k����Ke���\��
q=������m&7�;�,�~����n6*���nK}'C�K��^Go���y$�	'�'$����T��%���|�\����s9���JvKb�q$q���(��S\K�q������30�O�r��VJB�����>��f-#3I��38_Jv��T��y�TZ`�����;v9����8��;$�9�K��{���1R��=��_�{�jL�,@����O�v7:D6����w�A8*q��=����;X�
�#������o��8���2��X|�3�w�(J�g���/C��i��H�4���`�b�?�v�u�_
�����t�?p�	��8?����P��5P�21�\Y��8���98Ys�\�!�F���^���-n,>��2G��y��l��b��T9j]u>�S��]�N��b�����)�D�p9=�C	��q�����.�0H��^{���F�����g�T;����u9�Zu�"��u��tY�$ph�i��	r$V�?P���\�~���B��8�Q���z�0r
'��w?bL��(���=
(�:���>0��}?�+�I�F[oQ	�lZ�`1#5�����6�B�M����F+���}�=,|�=��~� �������Y���Cg���*�x<Cc4����';Xc#=�����<D-���+��\�g��X&���jRHT�{���[�����bG�\��	�@��.NG����q4q��0���q��m��]�{���7`Qc�{�!��nj������Dn=�3S����q�89�sSmG��t�����#�C�eXo�������y��&q�����P��j���1�I���t^�}cX��#�y$
x�����<;�Z�z\��8�w>�����t�����x��������(�����.@^���VR��5��������^�����7���������;]\��,��vg?�+�umAm��rq�I�3�����Mo���Lj������{<��t�f�H t������x�F���*���As�����q�h�E`;������<R����I�KUX�<����+
��5��Dy���DR0�q��Kg����i��^9���|5�LZ�Z�7���zd���� ��>J�h���Jn�UJ[7}��J:~�z�1��[���6C�:�8��zE����iBK��RxB�����9�B��{�p`?!��1p�]���u���yH����1���y����>�J�3�Wk���1���#rb�(i��`��Z��&�;*�s�3���u��-�WR�o�m���L69�=�R����'/���#p�0����BUi�lsZ���z�l�;��K�s��k��%��K�K����<G��7�5>&�YIf��[������Y�~��~f�s���_y��*W�)�!W��6ro�>���u�@
r��+�){�G-��
�zw��'�K������Q�#4�D�wD'�g������a���z�?=���~-��|!{�7�^=3_���f�%���y�����q��G>)�k6�x��]��5�������ou.��1������?�D�?���BL8��\v���S�i	<�r�c����4��i��P��(nzqH9�j��%�Y�*���]/���n����x�����g	����O�������oZ����m�>�Zb�����c_����z�?��������d�cpc��6���=�\v��NHU��2B��R�j^{�R@�8�k���H�y�F0{���/<���"K,�=k�2�|609����W�t}.$.��+(��X����t@���n<#������4�����o��PU7�0����d�,���mG����\I.��b���g1�>�7gn�M���]������aQ�Wk
�B
y����x;G��9����|��?���
+�=���#�D����������4��w�\�������k�����bQ�����4��[��<	����&��TW�x��E�u�]*����B���������J�O�H�6���u��7��g��@��O�[��2x�W�O���[�h0�@���~�\[}�2����k����]/\�^���	 5��������#7N�i�$C�g#����_�]O�:���f��������EfI	����z��\�<5��H�V���=����:d��Grcu��F��:�.�@q�������?g�E�	m�#s�C�5���R���n��%���2-��X��:	h���>�+�g+v67�'���{��z��Z|T�u�xb���ncY?�������O�]Q������y�J�)�Xv�f`�0����C��(��^ia;��vu8<���1T�#��n��lz���2�b����� \Cv=k���k��$q�;d��u�x��(��t������g�JOSD��s�� \��)���9��Rh������9�^��:��-�@�M���e�UR�/�H��;U�)��Rk'G��k�$##?��Zvr2�I��������/���d���*�=��6Y�
r�<���f�=+��hW*C."�����r�+��RO\�7epJ��o��!ox�I�����rs�'��_�>��n�i1����~��+��C�a�Y�<�W�������<%��>����?�X��3*����Fsk"C+�n��g��%O����R�	�2���Z�~bA��h�C�����7�0���C,B�	�U�������C�-A#l��Gc����� �'��h������J�eV��y�Q,��K`{�5&�!�"�H�>����>��OF�
Yl�&����^�82��;�qH���=)��LA�
J�����s���,+��g���{�S�p�x��q�W�|���G<d��8��<M�&w`��#��B�{���QJ��'����h�M��u�.��>��v�h	 ��4����M��$���|���IeI����i����p���J��+"���������Z�4���d�Q��C����[(�u+O��WpY�X���[a�u1.��9�X�xh�[�9(���nd���k�[$HX�=z�|=��^��6�hl�n�00}2{WG��F��zt��[[XZ3�c��RG(��C^����g�����]�v�V�����S�i�Y������%�8�����=�ZM�+�3)�TzkX��e�j��J�}�&o6�����v�~�o^k�}"\%��$|��3�3^����I��;Q03�����9l���a(R��O&�7Yr�Z����M�i���x���-FI]��>�
����	�W���v�����pZW'��P��F����zj6H�$���|M��i��w����~�GJ�s�d���|�g;�����,�����?C��=��p�s��5���f4���1������6#ds�u�������y##�~.|O���\�-��������U�y�Mx���	�N��������]�������2���\�56����[�F�� ��n����Z������^}�����?�VV�������)��4�v}�]�����������;��:���dq������z��
~�8�DW��W�G�k�~����=�*Z1��d�z����h���>-ks��4[��>R���.���tNn�����XdI�o��R�;�>�q�R�����9�ui�eoVX��+�i��a���k���A.#���;"P���+���b�X������w��Eu��or��������k��]���bkY��?�&���d� B\^K2�8_���#��INq��j�������"��`y���,?�w������<���>y�$�X�K�Gp��� �Px?����3\G}�K]f#���La���bG�;����?V�J���)���W��N]#K��4��[��n�e��>�������S��<�Q�[�n-��r���+0��V!�h���Op��~V�Gz���W��sO�w�h�n���.9�1��j��34�������*��'|m_�����Y$��tJ��a�w�	��E����v1e��9�=qSI9�2s���b~F*k�%J���Z�VFm]���iVf����c�	���mWX�<k{nE���{2�������q����|��?|�~:x��C��d�V{zW)myc�O@;�l�����ZW�?	i���X�D#E�}X���I=�5-'�2vV;o�p8�DpA�x�f� V������ �W��_��/�Z�^�	����t;^6Oc�Mk�&���{{���b�����v��^�i�	��7�Z��#P�zo��WE������7�5��
<L�m-7���37\d��9�k��k��i�_t[�W%l�3)�!���� �?�F�4�����h��	������
�J�]#�
9�nl��T�o�?Z��e �s_�����J�m�:������"�	$`�N~�W�E��660[&JC��<���M=����Q�r8��5��E �a,�o������4]JKH@��m�s�GQHf���e�]2H���u�5�w�_�w������'+�s������m3���to"�����7�����v������`�8��)�������ry��x��$���xZ��������1D>S7+����]x�Y�\����l��\`M =������������>xq�5�8�M���������8�+�i0�*(`��*�#)l��#�t-�����(����I�Aw��1���_��Q�J�R�Ue!X����/�|#�YIt�6#���������#��FWd��?NVZ�S�C0�!����.x�m����%�����.��������i7ddzb��|����@���l�H?�(�{=�A�V����u��0���U=X�������Vl�N��kx�S����O)PAp{UuR�3�y>�+N�rC�G_��hV7a���~���$^fU��J��Z\Pg��[�����U��$t�8��)M]g��?<-/�<}�����m���A�?�5����ag�[���W��_%��Z����R��H�,���������}�b����(�����4�H�m`��g�6b��z���a�k}�x�5V�%���o�`��JI�*rp���y4��m�8��$
vt��f�/��.d���z������vQ�mY��D_����X�
����c�JW!�p�$m�������Q�&����f&#9���)
�8*�����l'�=)4j��;��U'9�AO\����g�P�^��V����@�q�������W���������Vc��:p�P�Kc�
�QF�y���|����>Wa�*���O�2<�&	�3��_5������N�K����.1�f,I�������fF&��e4R�R��J�����W7eKE�����&�����&e��.��u:����
�.bZ���m�z���������#��y����_x�D����"��~�3"�������+�U��mdyn���>�o�F��7�);G*����@S��c�>
��!p���0����1��4��:)����D��Qi�%�4�[�X.!����3�.s�*}����h��~;�����6Ir����
�H`
xW��#I��i��Gg�+�R�P���0�N������5��oD�>���<L��������zs]��|I���$��|����xgO{9�
�G�u^+���vl�&[���T�r:o��?����?�w;��f�`z|���"�D�;���`6���������� ��������#V$�!_�v�LuQ�neS{)��6���}��8��eA�8=@k�
��!\�W���?���������9*�k����>-���?�������	�+S�I&�NJ{|�'�206�d��������F����G��{_�#�� nK���c�+?e"�X��!�����&>P��_��������
�������ZP~�
�������{'��m�UL�cc��fh���������dO���P�]�����+|8��e�I�f�7�^�v? ;�N��������K�x��='I�����_x�������������\:�<)���>�������x����*m=Y���dk|:���|�i��Im�����*Pk��V�*u`O�hs�W��I��Q�v�!��5������/a����������yc)�>���)�8��@�f���V��}�R�:4�>G��7{`��P;����z���xOU������6 �S\E���OZ���v��FX��=)� ��I����0�2/T?�M�"�	�Eh����[��9H�_�i�ZFr$�I	���|JX�I���<G��E��5����ZY�6��k�O���'�/G��V��~w�r�M���6}��U��[���kF�q���^���M���������
+b}b�2��"^<���������'���^ho|G$�2�c!�����O�@���&#��������+kh���5	Q�UE�
��c*��;���o|���-1.{�������s��^�8�oj��`s���O����s"v$c8�M:9���$�:�
WT�f�M��*�s�Q��z�I�\�T�7����66�20Ex��Q���a���F9�@�m�@(�I�pq�y��>�wT����JNE��d��X�-�
E��y���C�]�o#�8�����<
�<���\����vP��r���	������e������|G���~-�y�{���%��.$i���K�d���:s�����P�~�"y���sNQ�V������w�����XZ�,��<��b(<��3�+�m^!5�'���K��R�G����U���\9��
8��[��"�;m��J3|Z;|�^���o�z���Y�N����;{XP"G��*c(���(�\0�#�ib	��8�M\����MB'PO��w��*x��E�R��������
~r���

����P�Ho����VpG&�|?��{��c�Z�*H�G#�i�ub�.Ws�U�s+�����f,
j�65������T����:gP�����.����M��	�&x�!h�N#�Z{3�2R�d����1&�V���5]�������H��)Y��`�z�Z��VDb�NA�(i�+�r=�[��u�	
����5�E\�!�eEs�z��Z!�����|�3|F����
����n�X�y����+����N�V����>'�����i�X��}����5����]O���������^��ye��Q��
���o����[;����6����N�������j��������w���u�u����q�PO	��y�	������=��oD|:e�v��,0���*w ��������%�s���7_a#�F��zS�kcF��sE�y���0�HOoZ�J>�U�<`��x�������5�,������\Lyjz�K��5%�CmI��&���q�:SnX����RR:j�g��d.~t>���y��*f-�~�^I��o�M#X��xS���LV�M=N8�bG��X��<
�����������������l�����B�7$P�6I�Z�|�MK�Z�v��#��8�GrOj�����.��4��UG$���������^�H�������J���U������z[����B�����"]CPQ��|�������	��m�f�+-�|Q��FF�#��a;�l_9#h�=���12�������|���r�oS�}�!X��m?e���*��{$
�dxPq�����v?���bq7�\�$bK��?�����#6&L��kQ'X����n�t�[�Vz9�5my���?��<z�r�
��1��-���A�?����2Y���p�r��u�X�Lt�[�%��inCu~���*�T��pwH��q�V�I��Rq�(����������.���������$�����R���x�d�ym���}���z������\�<U%
�Gc�O�'��^�.:����W���#`�k�;�w����Z�X����-<W��7��a%��b*��5���9l�&�7t}��.3�f��������{F�z�Uy�mK��>*��;���>��}��@�|m'��lX�������d
�������/cS����}qG�Tw� ������?Z�7��{��f���q}^�c��p���M�����	����X������e�������L_W���hw��{�0+��O��/���Z��������Q��?�T?EF���<jP�Y�����j�?"����������sI��}V��U����_���������3����U!9�v�j���R�'P��
/���&}��_���>'���������}k������������n!�-rv�8l{������D��������2���
��L���I� �C��6g�j_5m>r�Z� ��;H�{�_��~IPN2�
���.�{!i�]���0kW��Vw�Y{X��)H|��Z@�Z�����q(�_Dx����V�|l��� ?������h6�{*��1����������Z��0xI�����F�4�eP�����8|e������+������
Bq�57�,�m�P�?�������9w?TS�ZX?���
c��K'?l����~Z���p/����=~$k-������/������~��i�I�d_���/�4�<�G��
���������l��������M��f��W`�����C��I^n���M���H�{��W��5r@�l��j�>8�c���,j~���s��|B���{��*D����^G��B�2��1���w)���cx�R ������Y��>�����������}
O�Y�*}����q_����Rb���M����Q}&���n4�%v�>������g�����S��}s�����W��x��O�|I�}V�-�_�������������g������������!���.+�hx��������J�&�'�>����>A�O3������&� ���H��������e�Ew����}Ui�Ov3�����/�y�<������*���G�����|<��p\W�U����@����U�����M������y������O�]B!�qL���?�c�p�~b��Ay���c��L:���>}CS������=?�7�����������?���M��1�vs���Q�v�����x�"���7��o��FI/�e=T��xG�_
�+��o5��
:��|�>��|x���?�p=�V��ns+��z��_F��vwL����"�5�`��K�A;d������66��]L���9>��1���~�Z��<3��k��7��������'���1�)�S�2R	���1��S�4�xb�����^�F���5������E�8 ������1�?�"�8�����Q��~�f=V�1�������D�Iw?Q"������&�B������_���}���]f����rjT�.��3������,�~�/�/
0�����_��'o����~b.�u���~4��]��o ��B�_�R�?N��v�uO��!���|o���E�;}w���> ���������:�gr=	���E�����Z�<A�[��\,�L�~S���Y����?�|��4�K�x���H#yq���t���c���*�]ws��EQ�*��������cb�����5<�|�w}�����r���;��z�U�1��z�����s�S%�>R8aP�H�,� Q�L��
���R,��1�{RZ�%rx��zU�8��*��U%q��<5$]���NNjodkMsI#�/�{�v�m���u��A�r�����F#��l�`m��6���zv�����mmXXd��p=F�������V�B%S%������.iZ�o��i��M7
l����B��U��y��[�c'�YVr�-���NO���1,��HT��������#��$�?*����S[Rr��IV���d�jeb��@W�A\W��#���+{��nG�3e�?���B2��DT�b���x�YAm����G�����~������������ow�-6�I[��'��v���k8�5�c,�Z�����)a*KYhxu����N��>;�T2�p��N9�)u�2Cdz�����9f$�ZL���t���yU�����	�H���
tv�L@��\O���n���Y�#��p����9c�P.�n'���S�:�rt����y�TriL��sL�zT�(<Ro�X\�s�N�@�iOqK���IJ�v������w��4u>������E��8�w��E1x���dz��&5���S��O1N�O�6�XP~n����Rp
q��+��9&��9�9s�M1r��P9��D�$t�i��l9G��v���TK�zS�us���>�*�������O*Sb�:�'9�1���RJ�0�lH@��/��QU�@�4�?�$�E�����W��j�J�CWb`�9���S�JOJ�����P�$����`���&�G&��,,]{�����S�s����z[���0�R�2*������^H,He9�W����O����S@�Zf���	��4��U
!��BI��(��R
%��O�SO�+~9�����(���8��d���8���t�J��#�i�q���4
+X��)U?Z�zf�T���b5L�^jUN������~u�}�HE^�*E�J''��&0{S]�Hb�^1��<6x�~����*r �hM-�H=j��$��dqo8�?J�|/�M�j��v�f�E�K�P�W��Q�����o�h������x�|X�d~U��n$�)Pw�O�m<��&H��������?���v��b��<��������	W��j��w6:�j��&�I���y_.p�1�=
&����;�����;s�V.]#RO�T������
�=�Y�O�%l���$g>�Yx�XS��w�]�3�	����.T�
@A$�g�=�Y�;�{�����}OOhZ�+7�d+��W����O�����������ya�*�Z����K+��J�������P�%K���_��6�[��}�1�*�<A�W���SO�k�Jm�������o�i]�l��������*�gV_��x�����I�'������+O��Bz���9�$�r�;H��X��J1�������8S^���T�R����_z:t��s��H�����V����<G���.�=i��7F�����q�d�Uh�Oj�,.����i��_���"V���+&t��7���W�����2��3�S-�c�
�v:lu��V?�j�nc
���
�?�#�SKA���JU8�U��lo��xx�M��KD��nG�(a�Ut����9ne��j?	��$1�)V@+7�t��p�D�����;�J=O���>�;��$���P3G#���'�8����I!��`*X�cC~@���d{�TbY�<]��~�j���B�O�S�_4�k9�iXW/n��+J��������@
��
�I�=+7�G�OK�1�b��k���V���}���Q}�&�&�~i� ���c��8��/�>��D�iXg��1���R���MJ/	 �qSkh������������������@��a�'�J�q����N�q����N3��}�9�k����q�����~���c��H��C:����s����iE����������iA�5���g"��g6i��5Y�7�+/�g�4�lS�C5<������7c�<{i�Pq��ky�<SZ@	���/4�z��'{��2�QLiG#����c�>���@��J�YW��$S�C�88�����S����5li�A�F�f�Yd�f���Qy��=v5M�����H��?:���~��8�&�2�5'���_��+:+��_���u����/�{T����~��/�&�/����x�d���������!_���5�7Cd���X��]��"3��Nk����n|+$z������-�|���M�����]����:G�?���u=YG�<��������>��G�_2)\@�����z�nih����&r�$w����;����R<9(�#i��:�b��.�I ���Z�Cp����g�C�
��~�y�B�G#�UC�8s��;�Q+��2�\�4���x����>������1���{T�������L{
��*X�A��}���P��y���j���>I2]�-'�O@z�U\����
��.I$��I���HD���B���L^���npA�C�'���E\S���M�R@�;��Q�R��Oh��d{����� ��&�)W�.4�'�?�Z�����F�z�������9�a�U��a������X�#l`�z�����q,G���{\���nw�l\�#��^����.?��R�J��SK�����W���7�8���O�����h��o~h�'�������Q\u���/�O�=����3�V����g{������@��?xq\�+�F����O,���=�;��D��v�4�w`?�I$u4��r	�����c�Z�}i�?�/�+�Q���84����oB����)?�!�������J9s����`�>r�t����=WZ���=���\����@�T�4����=G�k�K�z9P���C�������!���:����v����;;_�H��������p�4W���������,#���I��d�4��8.)�3;H�Uy?�|Uo����
p���8
9b>fw'�v��[�N,��1/��nGZp�����/�mq����>.��q�5���������������n��}�������O��W�)\U��I�����xf��M��������0})z\�q�3�o��7��3�8�@��rx�("�X��g[�	�{yW$t��|\�W5�/�;o�K���)����?�L>/�z�v���JS�v�)3�����q~&�oHH���f�
qF�}�6������z�#��M�#��Y�������k�&�a�y�����u����c���_�������	=�3��9�)�%�Y88�����8�D]���+���Q�����'�-q�����)��q��i2��7���*N#}j�1��|�w�sH��qd�*%���r>���b�<_�^�f���vV�:�{�8<����I�8�0��E(��~l������0�8) \��0��}�If���9�����D#��[f��,TOBq����%���������N����8��>[����?��rG��Q�FNXs��Cz�t�����_S�V�!��ye��w�������������d+��`�9�����Q�`���+[�<�Ws����u�cK��Kq�_C�������/R}���|�{�y�}�v�>��m�&���p���g"=Ly�9�u�1C�T;~;~��*����o���Ip������#��?7�������6�'=��\���s�����q��*�@0
�������	�R�8���)�x?�W^.��-�������%�����N���[G.���6�9MzD�+�y&)�
�T�s�]��W���$��������J2���|��|����cpQ��o��;&X������%FA�y������@X2+
8��$k_�����`����kOKiN8>��F
z��g�8��z���:R}i@����(�� 
U�������1M�J(B���h<��X	����A������sN4���R���4�`�	��$�Z`�������Jv;�� �I��1J�O�p{�3J���q������~��b�b����N+�=h#�7g�W�qQH=)��(�,"�4����!����{��|��r=3H=)��&���8�������4^�`��<c�����lC���� ������<���)��T��`���XD��g�F>V��s�J���J9�ris�Z9� 9�;��s��^����;+�N�4�/JUn8�������I�����*�FA���q��/���)��JW<z��(��M�x�K����*�����N���*pn���+�*�] lr(?+c����%��ls��[�=[#"�v�Hp��J�}}��@��!�����mKH�0y��S)#?�W���O�p>�T�u4G�~�>bx��A�������7���K8��$�[���q_~�o�$��F��85���-���(���W �Lq�?Jpv:�[&E���`9����Yn;a@�>�C�?�3��]uR�;�-�m�N�b5���~&���K��4M=:�k��s��!C����n^�4�4���z�mC!��Q�,O�k���tx6k&�v��[,���)���I���<E�Yo���uY�g>L�,
}�\/�	��3_�����$T�t���,dU��F8i5�����z���G}m�����/�)�7O�
��mS�Z}���nk����n���&�[_��&���G�F8�9'����o�>e���A�~��e�d����=��u'-g+/���
t�������-��K�������x���F�w:�4�p����^����c��[Gq�<����1��b���I�n��)oih�XZ��CaP�����}�!+�
]�9Y{���=�����@�����v��}K��r��zS�I)��!��f�G���=GOZ�2�����"�\���������/��85�'�'i�*�'��>�?������R���O�������#���'��|�zs^_�M*��Ju'
��H��5��l�W/��&�db���+��;>c��m���C������Z������?��J)(Nr�����X�D�r���{TkFH���Y��jq�@���W���-��
����Ev����(��@z��e8g��-����?ZxL��vN��@�(��u��AG��(�i���"�)^�a1�.(��K������8����9}ih�x��Sq���Z.���������4�H��zS��or)����i�qLu�<f�XwsNC��)�~n��aXr�� ��y�4�y�#�0A��`�F8����T�;����HjL������@y�Fx��p$S��qN'�:Tjz��du�c���O#�P���xn0(�x��R�ST����E&��rx��(`j��<�>��X�74��}i����<pi��O���* }M/�0n2i3Q��s�O=3�\d��s@?�zb��CFp:�`x4���Thq��N���Eu����E�����=>���~qN��L��q��C�Vc��~� <��Q��z���g���LO�OS��JNq��T�'�a���Mu&F;����L�e�ne� Ay�P>�� fi�<5��*�����n�>��^�����>��}�3K�F���e&�p���Q+�F���k������R=C\��~��/���d���E|��-W�p����{��)�	i�8}k�U'�]��t`�����M!$6����ZF9f>�y?�r~!��{���������x������"��<��S�d�����{W��d�O��uy%���'�� �_���j1�����'e��#���O�<qxm42�Q~���U~��~&�������#���ak����|z<��{B�����Eg�[�-��@"�L��WB,�G���r��xi�����[F%��Fr��;����O���9,����#w:��9�����:�K*Gi&��y��N���X����b����g�7�%a�Y�qQOmuux$�r��NkEF��vgz���Y�����y��A���A=����ft��^��W��"�k��N��{�*[�D(P���NGc])7�iE�v�~�9 z��x���Xn�k�����K��j1��&O�A�!�f��Kd����������z����E��=��C�W�e�u�6������]�t�N �|�Q����^N/���>�/w�n�GF�G4BEP����<V�dg�XS��RG���W���qr@����qC�8f9����kc�%����Iw�VBK�pON�������Fk��]�IO ��O��q~ �iw ��MuSZ��ZH�|I �r:�';~����m����J��;����l|�wv'QOS��J���\��g0��4�������x<g�=�P����SW��Z�4�x��?Zr���4����is�$���(�)+�I�@=[�p�=��q�8��jV�J:��4������������G4�R~j~}i��R)�l�����[��5&O]�NA����Zh9>��r
L�z��M �
���w���MPH���H�Cak�4�q�4)��X�������@�s@��������=M.��jp������!�ci��f�j���N�jE�����"��fA�p��.s��?�|��a�F�>~���:VV��1J���W�E�b0��S'����FN}��Ve��J��������Dg��j�<!y��6,y�R����`���r8��]�.��������^��<�q��4���9�x�Jz�]d>�e;�a��hZ|9�c���=F9����Pg	�
x��8��M�����!	\����V���� ��2A"�8r3�:E?a�L����M����;��?*���,���
nLv9��U��9lx^
�J�F������Y���i�*�.��$��1��}����$L����=G����*9�Vh���#`2�(��&������.4������5F���@��2L|�nS�0��a�K?��B������-7�Z�� ��0�{�z��7;-D�w��\Cj����WlnuVA�j�0�*8S���	|7R�g�K�!�w�o����R�,���&x��������
�\��w�������'dx���
�~ �V�hc<�p
�������}�cB�����G��M�JO�=>��xo���Y���c��aD�<������q������-�}���I;v<����X�S��Y|���K�j�e�����HE�@*�Gz�1��Z���=:���K�<nLp��v�x�K[Y\���;��	��S������ ��f�!��D����g4�g�g����.�zT��1���6H*q�zf���D@	�F�v�x�L����y=��U������H���������\���pI��}�Sy|��� d��=�'�(l������
��%�hA,�yPz���� ���L�q��}���,��7)���L��8�O,*��XAUO^���R��G�M���*�oZ����k�{���	z�`)�pq��>�".iO���������5k�lC�wa<�\~�_Z���g�
�����k��!�^H�����?<W%{�~��p�'o#�o����v�.����\����|����\�K����dG����	�-��g\(%@��_?e�>�Wz�~�/�Hr'���_��V���+��#v�B;���p�!��6'�MwR��<����k��l��b9���8�<��)$���>z�nZ���iFCU�a&�85/�T����UW[��reS��==�B=v��O|���S�{�>�>�"�9v2��;Q��_�"����T�xnW�WdT:�]KTe����S��+���������\���,�1�/m�~�}�@)=�B�������qV��	`��q����k&p�x��nz]��$pI����?�U��
'��q�4�7X�1R-���+�a�xw1h��v��|<��������x����x�Y�"��9`�5�O�$�c�cu�ho��;��7s�c�j>���$� t�xS�T����HRk���z���dq�Mk� �o��`����E,!�1�3`!'���$���?
�KO��
��?�X��B+���n1�sR�v)a�_���n��*a���S�>�W����<�� r:��N<
��%���>�����h��G���^fa���V����;�R+�[�g9\v�Z�k�c4C�As����AP�������`�]���n:��Zo�m���:�_D��!�uP� �����g�H���PX����r����2'�%����"���
B(r:q_@��fY@1���w�zo�A�s\�W<�?Z����M\���N���8������I6Rs�������&�r�n;����x(E)VN��T�-�Q���=���M�^2>m�)��������{��L�L��r<��8���V�2:�(B�h=*�"]���>s��u�?/���|5�"&Oq���U��P���185xx+�D�"��#$c���T�V���1G�V�
��c�������g�p0��A�Q^�o���}�r@<���[pxN0�js�A=O��f��!:(���j����e�=3V-���6�2�$��}i��b�'T8r ��_�?*�'�a��L���^G`s���*�&��H�����<[�#��Z��/H$�V�K�k���Z�lB�P�?Oz������X�����������Q�V_bS��J���=?,��|'I�����-����_��y�v�D�{?N+[O��� o$�Y}��?�m������W���e��@���3�����]��('�����d�#���>�����[���x�i.�9#�0N:����i��u�C����8��z*�"%�����eGkg+�!�1�t�I�pZ���9V ��~����)�[�n����)vS�$q�?������7EfX�AL�����!�q��o�U�0O[#����lFc20�z����?Z�������lI���Fl���T|��5K��-@2%�ZD�*������������kk4�)s��+��c�X�k��`���:V��������|J��i[
�}�����O�{��^u�\���{$��>J���#����u��9�F~c\F�����a�G���BK���������t����5-FB08��_�����}���Z��?x�����&�����l{ ���{?�?d�.9�{�'�.����c�����u)�z��"�e/�Y1�5�Nb�t��r��>���������&�+@�����o�����1������hW�]�#��E�[#���izV���b�8�e��3{na����]�y�+�~&V���3���	gNl��������	�}�zv�������M:����!_A&�"c'�� ��+�c�br?U�}.�RY��'p���������%M�dyf����
���d^����q�������m����1�'��"���?�6�r��G<�u��
���
J����#��TRV'��?��A�l�W
9�����tV�JC�
����8������|������fw�����Jj(��"�:�1��U��r��j6�7wP9�����M+���.?�q�;���j2�����x/���
D2<K�_���6;�z�Q��;��zE��a�H�)n
�
z�?�W2J��h#��I ��d��P����I�q�o��7Sb����NH�#����u����N*'>XiV&#�u���Y6j��#2m�����w�j�����9������@����_s�RH��A�H�pinh��(U�8�E���pG=k�{�A/'��8�K���8>��I��������7�����4���Q��W��q���}%�y�����M	���VR����^���=|���>G��"���\�g8�5�x���Ha3�����]����ic|.N�������:�`X��_L��+����]O��g�\�`��D���O��\������/����O���U��y���	�������7S�TcSU-c*����gQ�^���{Y��8�>���y,�{n��x���$xe��|�Z��V/3b*���X�'d�h�x�~V(L�c5r?���Y;��W���s��n��/\��V��]��F���� ����iud}Y#�-|	�hv�� ��{���N"
c��;b�����#���F\
���������{t:�J���_��b"H�V��H��tF`��z����%_/��F=j�xp1����� �X����Q�V�Hd�2x���Z����V"���<�u�L��(���pT�����,����nXp;d�rl�I&y���]M�($c�y���
1�� ��Gc^�i�9,n������W���_�%]<w��Q��R�<�/�E"�#)���
�	n��[����v�� ����� 
���p��gm�r������K�RU�p���F��!��������_"��C��0����$��,/R3�����;���\z�)��\��S�(?)	���p94�|2�o��W*p:���^�
G4|�(������6����1,d�G���>cX�Y\�����6����f��k��.V({�cc3Z�p�]�Py�#��*Y�#X���3�QIKA�������w�VL��j�����V
�;�Q����#<��"��"��������S����]7CR��P+ ���O|�|z�$q�eA �����t��������'�*i�4�*�H����Z�����������&���
��Yv�x����B����Zdl�����a�}�z�G��K��6��E��RF
���
7�=�a>T
���0A�����*���n?�VW��d����{��������H6W��w��QE`�������V��6c�[��������2�����=)��bM�[��][#���U
��f�dU]�7+^	���o�!��N��m��4c�F��C������������B�q��b�`�
I�����xc�2r���<
�'g��GQ�������������F��ps�v��]�8�Y!dT7w�U}�A1S���rqV	!*�.28�9�V��h|�%!W��"�7^;U�xReG�Fp;��W�+��R8�����ZO��v�Kc��}����CF�i�����8���m��B��J���1Y ���������+L�X�@7F���+H�Igx�������z�jt�W������gr��������sB����S����{���p5���gj�Y$�*��u�Xg��[���I�-K��a����� ������W �!Y�g$��9�<�]����$h�SN�]VE ���m��!�n?�{�\�~(j������[���vb����jr�j�f
2��#�����/�ul�+l�|���a���U��m����#�P���K���;u?�xF���{�a�bK������p�����H~{e��s�D��5cu�K�g����,����6�n�;,dQ�W��HT����Z����F���Infc�D�>�WI���<��Iu]�A�������s��*]=y�H�.1�����E�H����y�9 ��t��W�+�/D�..�83����~u�o��e?�
��K�]�~���X>��1��k��������I��A��Y�
�����G��
�o�1��������%�(���7��F'?��c����C���|�<i>�'�N�����}���a�O��A^�inm�H��`���bI��T��;V������"�,y������G��V��u5�/�3U~���a��-()�jc�>.�����P|����������,8�����k�������H�#�$��W#��! ���%A�����]:Q��cc&�-f�����Re���B��/��8
0��� �z������Y�C��w���I�+'��OQ�����t-7���P�E}�r��0?��I�X�F��pn?,gI�,{�e������)^D([�8�U\i<���cq8�?>��E
�d�H�8e;�����$��8��?LR�S?�2����?J.U�~r���,���Q������g��I�C8��d���p��?)C����}N��	�+��R������TU�#��y��s"�I7����/,�a�o)P����=���f��C��$���Zq�<�=O�����V�n��?�87��o';����0W�$�p}��f����g_3q��G��'���G`ln{�9
�bS��J{�>S6Y�y�5��#~��g�#�|�l�&�3����:�fB�W)'�����y`n0r�����z�������W	�zU%�*���v[��d��x�R�Oz�w��v�����z�3��n+�"���,�H��8-��J��&�u��������jk��t�
��W8�u7T�w�0}kRr��V�??������V%��i�;@�)�}95��V�G�a��Y�?i'(�S����HdD$'����T���I��q���d�t�x�:��d��1n��n��%a&�����+nb��B2zg?�cPO��!~P������� �1��]�9�=k��m����%����jt�1FK&P����XDcPW�Y�X��c"�jRE<w"1��U�\v���Z;=Nw��-���p>r=}�J]Qy�wc��t��� #�h��$}@����L|�����w�V��6�TB�Wi$�
sN0��!ye���j��3���d�-�W
p>��v;
��!H�
q�s����df�L��?)��P���X0d��G�W&>DA�6��8G��=YJ�j����7��z_9�R2�3�w����J�$BAs�}y�WJn�>q�`�I�6���������}E3H��L
���)o��G������4����a�6�U;hN�00��X���H<l�L���#�������� ��j��������q���������;pNr���4Z��y����~L��I�2�E9�e,����4�V3�A6��y�B��������'�6�#9���N�Y�O�,����Z�Q������������+l���������������r�!�������	�"�*��Y��M��������$��z��(�%j^���i�OPT��	��
E&�dP	�����?��n�0�B����W@	/6rGL���������*N�n��9��LX@����c���	c�O�P���;�jd��W8������i�e��4-�_* �>�{3�EU���Nl�����r�U\�p��Q�E]�`���R����G:4I271��������\�M�O�zc��l���8���ucWI������^9��P��	�F 0P�OLw�v����W�|g�`�L_'*z���j�������j��L���h�a�Q���E[����"?�q���J]F6��D�Xm�O��L��9%����z`��������X�(>���I��y�w�gi�T�����z�cY��B�s������7-Z�&9C)^0O~j���5��1�2.O+�ER�,�;c��k����mr�s/\�N+V���!����=��}���}B�s~���SisKDq����5gg�h��@U�,��Q��c�O��j�,���������������z����^#���W�#t�u�#�o�a�����0�������km�C���p�0*Ujp�+��gV��|��=�T���<M�ko���[����������y����w/�{u6�8�Ku&��@x_�
��|N� ���'���O�M���5�y\�DE$��Z��V�e�
B�wg����-���Z�}��i��A�j�����'��h�o�d���=�7���~����O���3������q���Qz������5I��Yy�9�g��n=L����Uq�\���W������?�uWs���#!�~5����;"���	�I���(�a����+�K(��G�X�#��V�M)�-������="�W�b�#�>fy���h^	�-a���j�;�+B���W�X��h����DF~�xv�_�!V��93on�>2�3���3���U�`2^,�<�0�����j�B����y�*/GibU��6�T��*��9?�"kH4��\� �xE�v�����p��T�����������p����u����w
�hz(u���'�=��\e<�ES������T�2U~B[~�?�5,�s�!�p����QMd�:�<u (B���R-�s���C6p?
��G�6��&y�����L������/�A��Uw�9M��
�X�`s���!l��e,~��@=����L�q��dRp�Wv}����#F�!p�98?N���"�f#�X��rZL~��LK���A�C�y~f9��?3��S����m����N�>x��_1�������r�K��YU������i�T����!UO�I���,�y#�d}�H��R$���.'9v���E��:I��Jc� ��*�������������UV�~g d���Ko9������8���=���
��~��7;'c�o����|f�F2��l*�9����;���� ePw09�����<���Q�������6���U������nP���{c
}?��"�E,�C���liX����+���{Z��"HfNb�`�9��V���b<m������Y7R�1���;O��E���Z�B��!����G#�y^RAG==*]�b`���#�t��U��"���c�O�������Ds�x���`����n|2'���N*i;�=�P}���\l�%L���� ��^��V����$�Mrb��H��;W�������z	3"<$�$lb?�e_�:�j��1`���~4�������2]\���rHn�5������FOLpk����l�4}���gh�D�3�#��0i���7_j��2I��������X��\�c�����J�CF��_��H��������q)�\���4fmKa�6Hv�������/,���T��QeH��R�$�����!�����0S����U.�1�Y��T$�*LN�J���s�0��i��	����#q$`��wQ����7�������Y��� �;���E��Wm�H�9������zz����x �q�m�#�o�rx����Lj-���
{�T7MH�`G�d�=�7�-Y���"eb��#��Z7j��q��x�P*��:d�sZw�i[<lt���2�F�����TkB�4�d�[+��psT���H��p�u��*�S$|g���[�ug���C0?^����TP�G�O���$lV9�<��Q�B�O�i�rE%��KCV�R�����������d�HA���}G���J����~�����,v`���Uu0��OXM�l6�+'��������Fvn��>�j���#�u�h>������|��\p;����W�
F"T��q��]���w����"����d��/�<s��$�����nA���Q���g	��+g�a���pA��ux�&TU���H��TwY&�� q��R���X�h��G^���*�L���eY���J��SpPA�r�Z)E����w!�V�w,V�!#8���Z
�^��tbS�9 �*3��W*A;��8����wyb�_>j.Ax=QVL�2L��s�G5/TG-��d��8���������<��=
d��2F
$��v�pC1��W�0jU���Q-f�;F�O��g�\��m��.q�=@���p�9�J� �^��H�Lf,���V�%�T! �KB:�>���U��g�\�YpF8����Q�K<3�VzzU�:�+��n�������6;��?�Tb��&s�#y2�����O(�v�Y�}���������'�w'�B����rL�U��$�v���1p��o�w�\f���o��R�{������0c��?7������L�S�����w�=E-n�����Y�����;W�5����VU)e5��.�~�p=3#���^=k��?Ce��F�
��p������%��
�5�m��:I�W�����co+O;����/�)��z}s:��T�I3��5����k���iX����I=1[>�1����-���_����=����tW�ZBU�i"��>((g�'��;�K�o\�:.�sw��|��c�?_C�'�U���c�T�u���7,@� �����Z
��
ZX�d�v���Y��^������o�W,��+#�~����R�����y6�\������{��~xs��������6��n��w9'�]��|�+���d:�"Y������;{{4G�+��t��?��5*��8����+�z,��Z�2n����O�*�dH��\���d����S������q+y�*�	<_n�n���I���[AB
!�rw�&����H�E��'+u�����e�I�3��6����4�(�����G�58}�Tl�������PQV��e'x�q�=i��%����I#�	�G�Q�#�;?�8l�w���d�*��I��:�|���]vWp��B���O��Z���������:�������[��v��H�X�����`:�?�h�������*F����j�*oWo��g�?��3
,������VvT�>���`3�3��V�����*<A������2�c�dn�n�5MHb�2�>~�}}j_�&GER���x?�^?�4�U�\��FB����=*i����VF-�@���j���CxcP�0��H��U�$-��G'�5W��jV���nH{���[�+��1�zz��j�����dU$cj�?������g��>hd��qE�kk��,����YB�^�����k���G�����T��X����MJ��D������z.
=�����,��r� d���F��79�:~5��i���	%w.~���N�E�3�|�1�~T�*�'�F�Qq�*1����)�JB���:8��9�-�����B8�r~���L���9]�����i\��%(
���`����o���-������qTe �#(���}������@�����_B�N��$�$m`�U�~�����f<o������[�@f���j3�����g���r����������G_Lw��f
�00�c���H�����v��=�$��2J��s�q�j���G�"����U�b��_K-�HV�&�~��I4��.8���D����\B��5�|<���-������GV���}����3�{��2E��
s��D��F,��0��O�g|17L��i��;�x��wXf[�}�.N	�����>��T��">`2�+���t�����y�<:������OZ���+ �I�����%��1��v|�WL{ne��kN�{,���7�:`q��*�J����r+'����fyz�k���G��I11i��)a�9�"��,������������F]9��a�/�rqB|���&�@�AXb_���Fh8@Q�>�J�!�VL�]���J��/!bq���*P��������(8��K�E�F���R1�M���<�t�Ir�l�6����U>l���&� e���l2Y����vJ�d@P���{���W���a#
N��)��W�^�����n�5{���	�����a+6���o�lcz7so�89�}(O[
C�'i,nP�{`��MCj���!����Ch���:����H�;J��:~T�v!ljl�_?�P=A����.��!M���B?QQF
�|�t9����������c��R�C6��S���������h��x"�aW,�������q��u�G���w�(��6�A��������h��b�����`��/�@g�@���b���8$s����-��L�=23��T�[��q#jd�=��L�0e�
��z�fG�r2�{:�1_+
I!�c�����.�/��~@������}�Ne�(8'�,��"�����]��6R>dm���3�;����[�}������<U�$��!�1�b���f���\2�GL���2l|n��s����Uq;�[���������a��2psZ��U�mJ��eb��(���9}��m���f�I!g�����|��+Z{9n�G+�R��*i����4Q��c���U��='L��K��nc��b7����"�y�?��gs�G�������+m����x?�,��5/Ei��#U<*������~�_�=>��z���E$:U��(b[�O����u���p�������!��O��&�~$�e����k���"�<� ,�8�kUg��3r�O��}k�:���'8��;��&d��d��w�~x����-�X�����>Z���?�{?�?e=KD�Y�Mb�`�y���S��|�0�Y]�B��W�Zg{�/]M������RB�������?�U����������8-ihw��,x�5����i�,X�Y$;F����Q],z�E�D��R�%2���G�'�K�V���\�\���G��}���^�!�4� p9����>�s��0+���m@;[[9�\��a�@�����A�a�t����y������;�����o�e���"�Ywq��VJ����4NmZ��m��77���f\��A�����T��kd��(�m[ ���S�������x�8��OJ��u�H�&#���B�S���[��id5B;�_����[�p�w���^�����(��`�5�����w����#�5�9t���8�=�:I���s����2����u���YK��|�w2q�3��Ud�2�
q��T����Y7s�L�8�����7R��`09=?w������&I���Q�)��Y�>�5 ��U@6�'���[B��<P�����~���M���8�<�O�UF�x���/����*�v��0;���t	m�ei�<1q�6?���iCz���##�O�A�\�\�B�
��i�5�h����rQ�}���2���l�+L���*���9��B��du��g����T��d#=7��������+|��kE&�J�����<�4 �%���O�R�7� �'ig�@�����K)�L�[a�\��{���Y��P_�����z���W-���V=��3�A�<z�)���F��� �������n���!��q��s��K-�� �zn����Bw'��.�f%c�O_���!������%q��L�%a�������J�V]��[�?���R�O��7���B��������4��i�q=��UFE.�e�����������K�V'����+"�� ����R��t
�QX2z}EU�W��*����q��xK	=����!����S�V�	�����Ve���%�F�`n^O�FV.�_j+|�T`c�����sq�<)���7.��C��FO��U�W�$x�,O<����H��A�oj�v��$��.Nc�����"�I$q��B�W?���� �l�;9d��Uf�(�5VI�=G�R�g �9�����%%wrF0X�;�m�}��p��~b=��6���T��{��Tw*nbdD;y��Y+G}O�0Mo��5C���g�E}<�>��A�'����*p��`c���I���j�����=:b�h���������j�h����Z�}������z)����3�%$_�*�;��g�������/{h|����8V��8k���D��]����j���l���yC��.��1���q�OsVm\[�1�\����T�	uN�CZZ|`����I�f�W7���Zr��PJ����ZVs<f�Q�����'y��>�����p��������X
�I�Egd��ZI)���l�cin�����`��%@����XIs6����#���[����
��8��{�66���Z"�c'�jh��s�8#��P�]Dx����N��yY��UKs����T�4��/���+i����yKQ+g���#�~�\��Y���z����h�|���#?�i �����_��!ry��
�pcH&RG�k*����>d��)���+F��9O��-�x�=�����!���#��Gy���Q�x��<������n�rL�����>���5���T���?��J^lbmT���c�$dv��k�!�b9*��*����V�J�����UkW*Q��98��F�@[�kKh��y�ps��D�,,�q!5�g!�l��H��`�*����
�n���W��%����7jW�20J���	�������=8�.��FH�� L����D�FF���zz����
V�47���qV��b�4��������V�|���.H'��|��.
r'�U���K��S6����=�jK\\"6N�� �H����d�=�"��<~�����;������#�X��,���1�8���C�V^�+F�x@pK�����������Z7����$��X6��izQI�K�S��uX��2>��x��j�?�o^�T��$ ��5���W�i9k�����Nl����T��n�G�k��@6e^��!;�}�S��������m���`p�G���W��p������L��e>�<����MX�VZ���#��C�L�M��=��ZF����u�k_7fl`�Kq^/��F����������g��
|c��Hl����E�������k�a�W5goS��S|�������+�?���s6���<_v-t�Y��o���=��������hVmj�Mbq�aL�
�pr1��f�<1�xf�����#e��`?@2��X�T��Q�-P�SZ��9xK�`�5�I�Cz4����D��	����t�g�?
x0�����s����K����+�[K�:�z|X�k����A�?�W4����*���FP�%�G<��8�PO=k�r�WJ���mR��W}�V�y������y�������'��Min�@�]��f�Am�w�������i��e}�0J`���^=� `]�TF0��SWOw�i�D
-���N�}���,=Ku'���9�Y�3d8!��'��=*<�,e�	�=�	i�b'<v�Us��q����E��<��B`����Sd�0 �ro���s��P�������L��@���������*AtY���S�?B*�r*��*����z~�����u$f�=��J�F@L�d0���P����&������o�)��,[��G�q�3�g����$� �O9$}*��IY�$��q�������2�BG�8=�
-�9&p�����~����@�q�ppq����d3�R��O=��Z�-�r�d�U�Rs�q��M�r��o6��'����p����1�N?�522]Fo;ON�����&�XY�O� 1���O���F��V�tm����YDB=�c;Y����?�z�P-�H�i��'�{��}i�D�������m���������b6�G$����m���q�zv>�[N��9���
����)�-Jj��-�%��P���G���I��heP�R9l�z���hv�$V�T������&R�|�������&e�6�7L`�G�2����!�~��n�5_@s����
t1��N�c$5V�Z���EdM���c��C��y�z}�}��Lg����i��`�E9�*}����v�����aH9��O���<��2���?Z�	m���A+�#��\�S6�U:���kK�'�Nl�9Q��[=
�!��C�������[���#��n�w5J�hD�6".����A�qR���;�X���pNA�������tGy M����Uk�0|��������[_��0��7�
4�;u	%V�r�%N�������l�I��[c��S���3����UZi$WBC��?t�TisH�=e�H�M�$��d'%{S��n1��==sT!�	��O����z�~�fi�G�,|�0S�����Lg	���q��g����Wr�;*2�zO�����g����9���6+��#?_z���$^���VF]�~*��� ����1�����QZ?�c�#������Sd�tRA�|����Jn�Z���
#6Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Michael Paquier (#3)
Re: silent data loss with ext4 / all current versions

On 11/27/2015 02:18 PM, Michael Paquier wrote:

On Fri, Nov 27, 2015 at 8:17 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

So, what's going on? The problem is that while the rename() is atomic, it's
not guaranteed to be durable without an explicit fsync on the parent
directory. And by default we only do fdatasync on the recycled segments,
which may not force fsync on the directory (and ext4 does not do that,
apparently).

Yeah, that seems to be the way the POSIX spec clears things.
"If _POSIX_SYNCHRONIZED_IO is defined, the fsync() function shall
force all currently queued I/O operations associated with the file
indicated by file descriptor fildes to the synchronized I/O completion
state. All I/O operations shall be completed as defined for
synchronized I/O file integrity completion."
http://pubs.opengroup.org/onlinepubs/009695399/functions/fsync.html
If I understand that right, it is guaranteed that the rename() will be
atomic, meaning that there will be only one file even if there is a
crash, but that we need to fsync() the parent directory as mentioned.

FWIW this has nothing to do with storage reliability - you may have good
drives, RAID controller with BBU, reliable SSDs or whatever, and you're
still not safe. This issue is at the filesystem level, not storage.

The POSIX spec authorizes this behavior, so the FS is not to blame,
clearly. At least that's what I get from it.

The spec seems a bit vague to me (but maybe it's not, I'm not a POSIX
expert), but we should be prepared for the less favorable interpretation
I think.

I think this issue might also result in various other issues, not just data
loss. For example, I wouldn't be surprised by data corruption due to
flushing some of the changes in data files to disk (due to contention for
shared buffers and reaching vm.dirty_bytes) and then losing the matching WAL
segment. Also, while I have only seen 1 to 3 segments getting lost, it might
be possible that more segments can get lost, possibly making the recovery
impossible. And of course, this might cause problems with WAL archiving due
to archiving the same
segment twice (before and after crash).

Possible, the switch to .done is done after renaming the segment in
xlogarchive.c. So this could happen in theory.

Yes. That's one of the suspicious places in my notes (haven't posted all
the details, the message was long enough already).

Attached is a proposed fix for this (xlog-fsync.patch), and I'm pretty sure
this needs to be backpatched to all backbranches. I've also attached a patch
that adds pg_current_xlog_flush_location() function, which proved to be
quite useful when debugging this issue.

Agreed. We should be sure as well that the calls to fsync_fname get
issued in a critical section with START/END_CRIT_SECTION(). It does
not seem to be the case with your patch.

Don't know. I've based that on code from replication/logical/ which does
fsync_fname() on all the interesting places, without the critical section.

regards

--
Tomas Vondra http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#7Michael Paquier
michael.paquier@gmail.com
In reply to: Tomas Vondra (#6)
Re: silent data loss with ext4 / all current versions

On Sat, Nov 28, 2015 at 3:01 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On 11/27/2015 02:18 PM, Michael Paquier wrote:

On Fri, Nov 27, 2015 at 8:17 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

So, what's going on? The problem is that while the rename() is atomic,
it's
not guaranteed to be durable without an explicit fsync on the parent
directory. And by default we only do fdatasync on the recycled segments,
which may not force fsync on the directory (and ext4 does not do that,
apparently).

Yeah, that seems to be the way the POSIX spec clears things.
"If _POSIX_SYNCHRONIZED_IO is defined, the fsync() function shall
force all currently queued I/O operations associated with the file
indicated by file descriptor fildes to the synchronized I/O completion
state. All I/O operations shall be completed as defined for
synchronized I/O file integrity completion."
http://pubs.opengroup.org/onlinepubs/009695399/functions/fsync.html
If I understand that right, it is guaranteed that the rename() will be
atomic, meaning that there will be only one file even if there is a
crash, but that we need to fsync() the parent directory as mentioned.

FWIW this has nothing to do with storage reliability - you may have good
drives, RAID controller with BBU, reliable SSDs or whatever, and you're
still not safe. This issue is at the filesystem level, not storage.

The POSIX spec authorizes this behavior, so the FS is not to blame,
clearly. At least that's what I get from it.

The spec seems a bit vague to me (but maybe it's not, I'm not a POSIX
expert),

As I am understanding it, FS implementations are free to decide to
make the rename persist on disk or not.

but we should be prepared for the less favorable interpretation I
think.

Yep. I agree. And in case my previous words were not clear, that's the
same line of thought here, we had better cover our backs and study
carefully each code path that could be impacted.

Attached is a proposed fix for this (xlog-fsync.patch), and I'm pretty
sure
this needs to be backpatched to all backbranches. I've also attached a
patch
that adds pg_current_xlog_flush_location() function, which proved to be
quite useful when debugging this issue.

Agreed. We should be sure as well that the calls to fsync_fname get
issued in a critical section with START/END_CRIT_SECTION(). It does
not seem to be the case with your patch.

Don't know. I've based that on code from replication/logical/ which does
fsync_fname() on all the interesting places, without the critical section.

For slot information in slot.c, there will be a PANIC when fsyncing
pg_replslot at some points. It does not seem that weird to do the same
for example after renaming the backup label file..
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Craig Ringer
craig@2ndquadrant.com
In reply to: Greg Stark (#4)
Re: silent data loss with ext4 / all current versions

On 27 November 2015 at 21:28, Greg Stark <stark@mit.edu> wrote:

On Fri, Nov 27, 2015 at 11:17 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

I plan to do more power failure testing soon, with more complex test
scenarios. I suspect there might be other similar issues (e.g. when we
rename a file before a checkpoint and don't fsync the directory - then

the

rename won't be replayed and will be lost).

I'm curious how you're doing this testing. The easiest way I can think
of would be to run a database on an LVM volume and take a large number
of LVM snapshots very rapidly and then see if the database can start
up from each snapshot. Bonus points for keeping track of the committed
transactions before each snaphsot and ensuring they're still there I
guess.

I've had a few tries at implementing a qemu-based crashtester where it hard
kills the qemu instance at a random point then starts it back up.

I always got stuck on the validation part - actually ensuring that the DB
state is how we expect. I think I could probably get that right now, it's
been a while.

The VM can be started back up and killed again over and over quite quickly.

It's not as good as physical plug-pull, but it's a lot more practical.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#9Craig Ringer
craig@2ndquadrant.com
In reply to: Tomas Vondra (#1)
Re: silent data loss with ext4 / all current versions

On 27 November 2015 at 19:17, Tomas Vondra <tomas.vondra@2ndquadrant.com>
wrote:

It's also possible to mitigate this by setting wal_sync_method=fsync

Are you sure?

https://lwn.net/Articles/322823/ tends to suggest that fsync() on the file
is insufficient to ensure rename() is persistent, though it's somewhat old.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#10Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Craig Ringer (#8)
Re: silent data loss with ext4 / all current versions

Hi,

On 11/29/2015 02:38 PM, Craig Ringer wrote:

On 27 November 2015 at 21:28, Greg Stark <stark@mit.edu
<mailto:stark@mit.edu>> wrote:

On Fri, Nov 27, 2015 at 11:17 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com <mailto:tomas.vondra@2ndquadrant.com>>
wrote:

I plan to do more power failure testing soon, with more complex test
scenarios. I suspect there might be other similar issues (e.g. when we
rename a file before a checkpoint and don't fsync the directory - then the
rename won't be replayed and will be lost).

I'm curious how you're doing this testing. The easiest way I can think
of would be to run a database on an LVM volume and take a large number
of LVM snapshots very rapidly and then see if the database can start
up from each snapshot. Bonus points for keeping track of the committed
transactions before each snaphsot and ensuring they're still there I
guess.

I've had a few tries at implementing a qemu-based crashtester where it
hard kills the qemu instance at a random point then starts it back up.

I've tried to reproduce the issue by killing a qemu VM, and so far I've
been unsuccessful. On bare HW it was easily reproducible (I'd hit the
issue 9 out of 10 attempts), so either I'm doing something wrong or qemu
somehow interacts with the I/O.

I always got stuck on the validation part - actually ensuring that the
DB state is how we expect. I think I could probably get that right now,
it's been a while.

Weel, I guess we can't really check all the details, but I guess the
checksums make checking the general consistency somewhat simpler. And
then you have to design the workload in a way that makes the check
easier - for example remembering the committed values etc.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Craig Ringer (#9)
Re: silent data loss with ext4 / all current versions

On 11/29/2015 02:41 PM, Craig Ringer wrote:

On 27 November 2015 at 19:17, Tomas Vondra <tomas.vondra@2ndquadrant.com
<mailto:tomas.vondra@2ndquadrant.com>> wrote:

It's also possible to mitigate this by setting wal_sync_method=fsync

Are you sure?

https://lwn.net/Articles/322823/ tends to suggest that fsync() on the
file is insufficient to ensure rename() is persistent, though it's
somewhat old.

Good point. I don't know, and I'm not any smarter after reading the LWN
article. What I meant by "mitigate" is that I've been unable to
reproduce the issue after setting wal_sync_method=fsync, so my
conclusion is that it either fixes the issue or at least significantly
reduces the probability of hitting it.

It's pretty clear that the right fix is the additional fsync on pg_xlog.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tomas Vondra (#10)
Re: silent data loss with ext4 / all current versions

On 11/29/2015 03:33 PM, Tomas Vondra wrote:

Hi,

On 11/29/2015 02:38 PM, Craig Ringer wrote:

I've had a few tries at implementing a qemu-based crashtester where it
hard kills the qemu instance at a random point then starts it back up.

I've tried to reproduce the issue by killing a qemu VM, and so far I've
been unsuccessful. On bare HW it was easily reproducible (I'd hit the
issue 9 out of 10 attempts), so either I'm doing something wrong or qemu
somehow interacts with the I/O.

Update: I've managed to reproduce the issue in the qemu setup - I think
it needs slightly different timing due to the VM being slightly slower.
I also tweaked vm.dirty_bytes and vm.dirty_background_bytes to values
used on the bare hardware (I suspect it widens the window).

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13Peter Eisentraut
peter_e@gmx.net
In reply to: Michael Paquier (#3)
Re: silent data loss with ext4 / all current versions

On 11/27/15 8:18 AM, Michael Paquier wrote:

On Fri, Nov 27, 2015 at 8:17 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

So, what's going on? The problem is that while the rename() is atomic, it's
not guaranteed to be durable without an explicit fsync on the parent
directory. And by default we only do fdatasync on the recycled segments,
which may not force fsync on the directory (and ext4 does not do that,
apparently).

Yeah, that seems to be the way the POSIX spec clears things.
"If _POSIX_SYNCHRONIZED_IO is defined, the fsync() function shall
force all currently queued I/O operations associated with the file
indicated by file descriptor fildes to the synchronized I/O completion
state. All I/O operations shall be completed as defined for
synchronized I/O file integrity completion."
http://pubs.opengroup.org/onlinepubs/009695399/functions/fsync.html
If I understand that right, it is guaranteed that the rename() will be
atomic, meaning that there will be only one file even if there is a
crash, but that we need to fsync() the parent directory as mentioned.

I don't see anywhere in the spec that a rename needs an fsync of the
directory to be durable. I can see why that would be needed in
practice, though. File system developers would probably be able to give
a more definite answer.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Peter Eisentraut (#13)
Re: silent data loss with ext4 / all current versions

On 12/01/2015 10:44 PM, Peter Eisentraut wrote:

On 11/27/15 8:18 AM, Michael Paquier wrote:

On Fri, Nov 27, 2015 at 8:17 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

So, what's going on? The problem is that while the rename() is atomic, it's
not guaranteed to be durable without an explicit fsync on the parent
directory. And by default we only do fdatasync on the recycled segments,
which may not force fsync on the directory (and ext4 does not do that,
apparently).

Yeah, that seems to be the way the POSIX spec clears things.
"If _POSIX_SYNCHRONIZED_IO is defined, the fsync() function shall
force all currently queued I/O operations associated with the file
indicated by file descriptor fildes to the synchronized I/O completion
state. All I/O operations shall be completed as defined for
synchronized I/O file integrity completion."
http://pubs.opengroup.org/onlinepubs/009695399/functions/fsync.html
If I understand that right, it is guaranteed that the rename() will be
atomic, meaning that there will be only one file even if there is a
crash, but that we need to fsync() the parent directory as mentioned.

I don't see anywhere in the spec that a rename needs an fsync of the
directory to be durable. I can see why that would be needed in
practice, though. File system developers would probably be able to
give a more definite answer.

Yeah, POSIX is the smallest common denominator. In this case the spec
seems not to require this durability guarantee (rename without fsync on
directory), which allows a POSIX-compliant filesystem.

At least that's my conclusion from reading https://lwn.net/Articles/322823/

However, as I explained in the original post, it's more complicated as
this only seems to be problem with fdatasync. I've been unable to
reproduce the issue with wal_sync_method=fsync.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Michael Paquier (#7)
1 attachment(s)
Re: silent data loss with ext4 / all current versions

Attached is v2 of the patch, that

(a) adds explicit fsync on the parent directory after all the rename()
calls in timeline.c, xlog.c, xlogarchive.c and pgarch.c

(b) adds START/END_CRIT_SECTION around the new fsync_fname calls
(except for those in timeline.c, as the START/END_CRIT_SECTION is
not available there)

The patch is fairly trivial and I've done some rudimentary testing, but
I'm sure I haven't exercised all the modified paths.

regards
Tomas

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

xlog-fsync-v2.patchtext/x-diff; name=xlog-fsync-v2.patchDownload
diff --git a/src/backend/access/transam/timeline.c b/src/backend/access/transam/timeline.c
index c6862a8..998e50b 100644
--- a/src/backend/access/transam/timeline.c
+++ b/src/backend/access/transam/timeline.c
@@ -437,6 +437,9 @@ writeTimeLineHistory(TimeLineID newTLI, TimeLineID parentTLI,
 						tmppath, path)));
 #endif
 
+	/* Make sure the rename is permanent by fyncing the directory. */
+	fsync_fname(XLOGDIR, true);
+
 	/* The history file can be archived immediately. */
 	if (XLogArchivingActive())
 	{
@@ -526,6 +529,9 @@ writeTimeLineHistoryFile(TimeLineID tli, char *content, int size)
 				 errmsg("could not rename file \"%s\" to \"%s\": %m",
 						tmppath, path)));
 #endif
+
+	/* Make sure the rename is permanent by fyncing the directory. */
+	fsync_fname(XLOGDIR, true);
 }
 
 /*
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index f17f834..de24a09 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -3282,6 +3282,10 @@ InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
 	}
 #endif
 
+	START_CRIT_SECTION();
+	fsync_fname(XLOGDIR, true);
+	END_CRIT_SECTION();
+
 	if (use_lock)
 		LWLockRelease(ControlFileLock);
 
@@ -3806,6 +3810,11 @@ RemoveXlogFile(const char *segname, XLogRecPtr PriorRedoPtr, XLogRecPtr endptr)
 #else
 		rc = unlink(path);
 #endif
+
+		START_CRIT_SECTION();
+		fsync_fname(XLOGDIR, true);
+		END_CRIT_SECTION();
+
 		if (rc != 0)
 		{
 			ereport(LOG,
@@ -5302,6 +5311,10 @@ exitArchiveRecovery(TimeLineID endTLI, XLogRecPtr endOfLog)
 				 errmsg("could not rename file \"%s\" to \"%s\": %m",
 						RECOVERY_COMMAND_FILE, RECOVERY_COMMAND_DONE)));
 
+	START_CRIT_SECTION();
+	fsync_fname(".", true);
+	END_CRIT_SECTION();
+
 	ereport(LOG,
 			(errmsg("archive recovery complete")));
 }
@@ -6155,6 +6168,11 @@ StartupXLOG(void)
 								TABLESPACE_MAP, BACKUP_LABEL_FILE),
 						 errdetail("Could not rename file \"%s\" to \"%s\": %m.",
 								   TABLESPACE_MAP, TABLESPACE_MAP_OLD)));
+
+			/* fsync the data directory to persist the rename() */
+			START_CRIT_SECTION();
+			fsync_fname(".", true);
+			END_CRIT_SECTION();
 		}
 
 		/*
@@ -6522,6 +6540,14 @@ StartupXLOG(void)
 								TABLESPACE_MAP, TABLESPACE_MAP_OLD)));
 		}
 
+		/* fsync the data directory to persist the rename() */
+		if (haveBackupLabel || haveTblspcMap)
+		{
+			START_CRIT_SECTION();
+			fsync_fname(".", true);
+			END_CRIT_SECTION();
+		}
+
 		/* Check that the GUCs used to generate the WAL allow recovery */
 		CheckRequiredParameterValues();
 
@@ -7303,6 +7329,10 @@ StartupXLOG(void)
 						 errmsg("could not rename file \"%s\" to \"%s\": %m",
 								origpath, partialpath)));
 				XLogArchiveNotify(partialfname);
+
+				START_CRIT_SECTION();
+				fsync_fname(XLOGDIR, true);
+				END_CRIT_SECTION();
 			}
 		}
 	}
@@ -10906,6 +10936,11 @@ CancelBackup(void)
 						   BACKUP_LABEL_FILE, BACKUP_LABEL_OLD,
 						   TABLESPACE_MAP, TABLESPACE_MAP_OLD)));
 	}
+
+	/* fsync the data directory to persist the renames */
+	START_CRIT_SECTION();
+	fsync_fname(".", true);
+	END_CRIT_SECTION();
 }
 
 /*
diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c
index 7af56a9..1219f4b 100644
--- a/src/backend/access/transam/xlogarchive.c
+++ b/src/backend/access/transam/xlogarchive.c
@@ -476,6 +476,10 @@ KeepFileRestoredFromArchive(char *path, char *xlogfname)
 				 errmsg("could not rename file \"%s\" to \"%s\": %m",
 						path, xlogfpath)));
 
+	START_CRIT_SECTION();
+	fsync_fname(XLOGDIR, true);
+	END_CRIT_SECTION();
+
 	/*
 	 * Create .done file forcibly to prevent the restored segment from being
 	 * archived again later.
@@ -586,6 +590,10 @@ XLogArchiveForceDone(const char *xlog)
 					 errmsg("could not rename file \"%s\" to \"%s\": %m",
 							archiveReady, archiveDone)));
 
+		START_CRIT_SECTION();
+		fsync_fname(XLOGDIR "/archive_status", true);
+		END_CRIT_SECTION();
+
 		return;
 	}
 
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 4df669e..be59442 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -733,4 +733,8 @@ pgarch_archiveDone(char *xlog)
 				(errcode_for_file_access(),
 				 errmsg("could not rename file \"%s\" to \"%s\": %m",
 						rlogready, rlogdone)));
+
+	START_CRIT_SECTION();
+	fsync_fname(XLOGDIR "/archive_status", true);
+	END_CRIT_SECTION();
 }
#16Michael Paquier
michael.paquier@gmail.com
In reply to: Tomas Vondra (#15)
Re: silent data loss with ext4 / all current versions

On Wed, Dec 2, 2015 at 7:05 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

Attached is v2 of the patch, that

(a) adds explicit fsync on the parent directory after all the rename()
calls in timeline.c, xlog.c, xlogarchive.c and pgarch.c

(b) adds START/END_CRIT_SECTION around the new fsync_fname calls
(except for those in timeline.c, as the START/END_CRIT_SECTION is
not available there)

The patch is fairly trivial and I've done some rudimentary testing, but I'm
sure I haven't exercised all the modified paths.

I would like to have an in-depth look at that after finishing the
current CF, I am the manager of this one after all... Could you
register it to 2016-01 CF for the time being? I don't mind being
beaten by someone else if this someone has some room to look at this
patch..
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17Michael Paquier
michael.paquier@gmail.com
In reply to: Michael Paquier (#16)
Re: silent data loss with ext4 / all current versions

On Wed, Dec 2, 2015 at 3:23 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Wed, Dec 2, 2015 at 7:05 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

Attached is v2 of the patch, that

(a) adds explicit fsync on the parent directory after all the rename()
calls in timeline.c, xlog.c, xlogarchive.c and pgarch.c

(b) adds START/END_CRIT_SECTION around the new fsync_fname calls
(except for those in timeline.c, as the START/END_CRIT_SECTION is
not available there)

The patch is fairly trivial and I've done some rudimentary testing, but I'm
sure I haven't exercised all the modified paths.

I would like to have an in-depth look at that after finishing the
current CF, I am the manager of this one after all... Could you
register it to 2016-01 CF for the time being? I don't mind being
beaten by someone else if this someone has some room to look at this
patch..

And please feel free to add my name as reviewer.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18Michael Paquier
michael.paquier@gmail.com
In reply to: Michael Paquier (#17)
Re: silent data loss with ext4 / all current versions

On Wed, Dec 2, 2015 at 3:24 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Wed, Dec 2, 2015 at 3:23 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Wed, Dec 2, 2015 at 7:05 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

Attached is v2 of the patch, that

(a) adds explicit fsync on the parent directory after all the rename()
calls in timeline.c, xlog.c, xlogarchive.c and pgarch.c

(b) adds START/END_CRIT_SECTION around the new fsync_fname calls
(except for those in timeline.c, as the START/END_CRIT_SECTION is
not available there)

The patch is fairly trivial and I've done some rudimentary testing, but I'm
sure I haven't exercised all the modified paths.

I would like to have an in-depth look at that after finishing the
current CF, I am the manager of this one after all... Could you
register it to 2016-01 CF for the time being? I don't mind being
beaten by someone else if this someone has some room to look at this
patch..

And please feel free to add my name as reviewer.

Tomas, I am planning to have a look at that, because it seems to be
important. In case it becomes lost on my radar, do you mind if I add
it to the 2016-03 CF?
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Michael Paquier (#18)
Re: silent data loss with ext4 / all current versions

On 01/19/2016 07:44 AM, Michael Paquier wrote:

On Wed, Dec 2, 2015 at 3:24 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Wed, Dec 2, 2015 at 3:23 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Wed, Dec 2, 2015 at 7:05 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

Attached is v2 of the patch, that

(a) adds explicit fsync on the parent directory after all the rename()
calls in timeline.c, xlog.c, xlogarchive.c and pgarch.c

(b) adds START/END_CRIT_SECTION around the new fsync_fname calls
(except for those in timeline.c, as the START/END_CRIT_SECTION is
not available there)

The patch is fairly trivial and I've done some rudimentary testing, but I'm
sure I haven't exercised all the modified paths.

I would like to have an in-depth look at that after finishing the
current CF, I am the manager of this one after all... Could you
register it to 2016-01 CF for the time being? I don't mind being
beaten by someone else if this someone has some room to look at this
patch..

And please feel free to add my name as reviewer.

Tomas, I am planning to have a look at that, because it seems to be
important. In case it becomes lost on my radar, do you mind if I add
it to the 2016-03 CF?

Well, what else can I do? I have to admit I'm quite surprised this is
still rotting here, considering it addresses a rather serious data loss
/ corruption issue on pretty common setup.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20Michael Paquier
michael.paquier@gmail.com
In reply to: Tomas Vondra (#19)
Re: silent data loss with ext4 / all current versions

On Tue, Jan 19, 2016 at 3:58 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On 01/19/2016 07:44 AM, Michael Paquier wrote:

On Wed, Dec 2, 2015 at 3:24 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Wed, Dec 2, 2015 at 3:23 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Wed, Dec 2, 2015 at 7:05 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

Attached is v2 of the patch, that

(a) adds explicit fsync on the parent directory after all the rename()
calls in timeline.c, xlog.c, xlogarchive.c and pgarch.c

(b) adds START/END_CRIT_SECTION around the new fsync_fname calls
(except for those in timeline.c, as the START/END_CRIT_SECTION is
not available there)

The patch is fairly trivial and I've done some rudimentary testing, but
I'm
sure I haven't exercised all the modified paths.

I would like to have an in-depth look at that after finishing the
current CF, I am the manager of this one after all... Could you
register it to 2016-01 CF for the time being? I don't mind being
beaten by someone else if this someone has some room to look at this
patch..

And please feel free to add my name as reviewer.

Tomas, I am planning to have a look at that, because it seems to be
important. In case it becomes lost on my radar, do you mind if I add
it to the 2016-03 CF?

Well, what else can I do? I have to admit I'm quite surprised this is still
rotting here, considering it addresses a rather serious data loss /
corruption issue on pretty common setup.

Well, I think you did what you could. And we need to be sure now that
it gets in and that this patch gets a serious lookup. So for now my
guess is that not loosing track of it would be a good first move. I
have added it here to attract more attention:
https://commitfest.postgresql.org/9/484/
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#21Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Michael Paquier (#20)
Re: silent data loss with ext4 / all current versions

On 01/19/2016 08:03 AM, Michael Paquier wrote:

On Tue, Jan 19, 2016 at 3:58 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

...

Tomas, I am planning to have a look at that, because it seems to be
important. In case it becomes lost on my radar, do you mind if I add
it to the 2016-03 CF?

Well, what else can I do? I have to admit I'm quite surprised this is still
rotting here, considering it addresses a rather serious data loss /
corruption issue on pretty common setup.

Well, I think you did what you could. And we need to be sure now that
it gets in and that this patch gets a serious lookup. So for now my
guess is that not loosing track of it would be a good first move. I
have added it here to attract more attention:
https://commitfest.postgresql.org/9/484/

Ah, thanks. I haven't realized it's not added into 2016-1 (I'd swear I
added it into the CF app).

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#22Michael Paquier
michael.paquier@gmail.com
In reply to: Tomas Vondra (#21)
Re: silent data loss with ext4 / all current versions

On Tue, Jan 19, 2016 at 4:20 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On 01/19/2016 08:03 AM, Michael Paquier wrote:

On Tue, Jan 19, 2016 at 3:58 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

...

Tomas, I am planning to have a look at that, because it seems to be
important. In case it becomes lost on my radar, do you mind if I add
it to the 2016-03 CF?

Well, what else can I do? I have to admit I'm quite surprised this is
still
rotting here, considering it addresses a rather serious data loss /
corruption issue on pretty common setup.

Well, I think you did what you could. And we need to be sure now that
it gets in and that this patch gets a serious lookup. So for now my
guess is that not loosing track of it would be a good first move. I
have added it here to attract more attention:
https://commitfest.postgresql.org/9/484/

Ah, thanks. I haven't realized it's not added into 2016-1 (I'd swear I added
it into the CF app).

So, I have been playing with a Linux VM with VMware Fusion and on ext4
with data=ordered the renames are getting lost if the root folder is
not fsync. By killing-9 the VM I am able to reproduce that really
easily.

Here are some comments about your patch after a look at the code.

Regarding the additions in fsync_fname() in xlog.c:
1) In InstallXLogFileSegment, rename() will be called only if
HAVE_WORKING_LINK is not used, which happens only on Windows and
cygwin. We could add it for consistency, but it should be within the
#else/#endif block. It is not critical as of now.
2) The call in RemoveXlogFile is not necessary, the rename happening
only on Windows.
3) In exitArchiveRecovery if the rename is not made durable I think it
does not matter much. Even if recovery.conf is the one present once
the node restarts node would move back again to recovery, and actually
we had better move back to recovery in this case, no?
4) StartupXLOG for the tablespace map. I don't think this one is
needed as well. Even if the tablespace map is not removed after a
power loss user would get an error telling that the file should be
removed.
5) For the one where haveBackupLabel || haveTblspcMap. If we do the
fsync, we guarantee that there is no need to do again the recovery.
But in case of a power loss, isn't it better to do the recovery again?
6) For the one after XLogArchiveNotify() for the last partial segment
of the old timeline, it doesn't really matter to not make the change
persistent as this is mainly done because it is useful to identify
that it is a partial segment.
7) In CancelBackup, this one is not needed as well I think. We would
surely want to get back to recovery if those files remain after a
power loss.

For the ones in xlogarchive.c:
1) For KeepFileRestoredFromArchive, it does not matter here, we are
renaming a file with a temporary name to a permanent name. Once the
node restarts, we would see an extra temporary file if the rename was
not effective.
2) XLogArchiveForceDone, the only bad thing that would happen here is
to leave behind a .ready file instead of a .done file. I guess that we
could have it though as an optimization to not have to archive again
this file.

For the one in pgarch.c:
1) In pgarch_archiveDone, we could add it as an optimization to
actually let the server know that the segment has been already
archived, preventing a retry.

In timeline.c:
1) writeTimeLineHistoryFile, it does not matter much. In the worst
case we would have just a dummy temporary file, and startup process
would come back here (in exitArchiveRecovery() we may finish with an
unnamed backup file similarly).
2) In writeTimeLineHistoryFile, similarly we don't need to care much,
in WalRcvFetchTimeLineHistoryFiles recovery would take again the same
path

Thoughts?
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#23Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Michael Paquier (#22)
Re: silent data loss with ext4 / all current versions

Hi,

On 01/22/2016 06:45 AM, Michael Paquier wrote:

So, I have been playing with a Linux VM with VMware Fusion and on
ext4 with data=ordered the renames are getting lost if the root
folder is not fsync. By killing-9 the VM I am able to reproduce that
really easily.

Yep. Same experience here (with qemu-kvm VMs).

Here are some comments about your patch after a look at the code.

Regarding the additions in fsync_fname() in xlog.c:
1) In InstallXLogFileSegment, rename() will be called only if
HAVE_WORKING_LINK is not used, which happens only on Windows and
cygwin. We could add it for consistency, but it should be within the
#else/#endif block. It is not critical as of now.
2) The call in RemoveXlogFile is not necessary, the rename happening
only on Windows.

Hmmm, OK. Are we sure HAVE_WORKING_LINK is false only on Windows, or
could there be some other platforms? And are we sure the file systems on
those platforms are safe without the fsync call?

That is, while the report references ext4, there may be other file
systems with the same problem - ext4 was used mostly as it's the most
widely used Linux file system.

3) In exitArchiveRecovery if the rename is not made durable I think
it does not matter much. Even if recovery.conf is the one present
once the node restarts node would move back again to recovery, and
actually we had better move back to recovery in this case, no?

I'm strongly against this "optimization" - I'm more than happy to
exchange the one fsync for not having to manually fix the database after
crash.

I don't really see why switching back to recovery should be desirable in
this case? Imagine you have a warm/hot standby, and that you promote it
to master. The clients connect, start issuing commands and then the
system crashes and loses the recovery.conf rename. The system reboots,
database performs local recovery but then starts as a standby and starts
rejecting writes. That seems really weird to me.

4) StartupXLOG for the tablespace map. I don't think this one is
needed as well. Even if the tablespace map is not removed after a
power loss user would get an error telling that the file should be
removed.

Please no, for the same reasons as in (3).

5) For the one where haveBackupLabel || haveTblspcMap. If we do the
fsync, we guarantee that there is no need to do again the recovery.
But in case of a power loss, isn't it better to do the recovery again?

Why would it be better? Why should we do something twice when we don't
have to? Had this not be reliable, then the whole recovery process is
fundamentally broken and we better fix it instead of merely putting a
band-aid on it.

6) For the one after XLogArchiveNotify() for the last partial
segment of the old timeline, it doesn't really matter to not make the
change persistent as this is mainly done because it is useful to
identify that it is a partial segment.

OK, although I still don't quite see why that should be a reason not to
do the fsync. It's not really going to give us any measurable
performance advantage (how often we do those fsyncs), so I'd vote to
keep it and make sure the partial segments are named accordingly. Less
confusion is always better.

7) In CancelBackup, this one is not needed as well I think. We would
surely want to get back to recovery if those files remain after a
power loss.

I may be missing something, but why would we switch to recovery in this
case?

For the ones in xlogarchive.c:
1) For KeepFileRestoredFromArchive, it does not matter here, we are
renaming a file with a temporary name to a permanent name. Once the
node restarts, we would see an extra temporary file if the rename
was not effective.

So we'll lose the segment (won't have it locally under the permanent
name), as we've already restored it and won't do that again. Is that
really a good thing to do?

2) XLogArchiveForceDone, the only bad thing that would happen here is
to leave behind a .ready file instead of a .done file. I guess that we
could have it though as an optimization to not have to archive again
this file.

Yes.

For the one in pgarch.c:
1) In pgarch_archiveDone, we could add it as an optimization to
actually let the server know that the segment has been already
archived, preventing a retry.

Not sure what you mean by "could add it as an optimization"?

In timeline.c:
1) writeTimeLineHistoryFile, it does not matter much. In the worst
case we would have just a dummy temporary file, and startup process
would come back here (in exitArchiveRecovery() we may finish with an
unnamed backup file similarly).

OK

2) In writeTimeLineHistoryFile, similarly we don't need to care
much, in WalRcvFetchTimeLineHistoryFiles recovery would take again
the same path

OK

Thoughts?

Thanks for the review and comments. I think the question is whether we
only want to do the additional fsync() only when it ultimately may lead
to data loss, or even in cases where it may cause operational issues
(e.g. switching back to recovery needlessly).

I'd vote for the latter, as I think it makes the database easier to
operate (less manual interventions) and the performance impact is 0 (as
those fsyncs are really rare).

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#24Magnus Hagander
magnus@hagander.net
In reply to: Tomas Vondra (#23)
Re: silent data loss with ext4 / all current versions

On Fri, Jan 22, 2016 at 9:26 AM, Tomas Vondra <tomas.vondra@2ndquadrant.com>
wrote:

Hi,

On 01/22/2016 06:45 AM, Michael Paquier wrote:

So, I have been playing with a Linux VM with VMware Fusion and on

ext4 with data=ordered the renames are getting lost if the root
folder is not fsync. By killing-9 the VM I am able to reproduce that
really easily.

Yep. Same experience here (with qemu-kvm VMs).

Here are some comments about your patch after a look at the code.

Regarding the additions in fsync_fname() in xlog.c:
1) In InstallXLogFileSegment, rename() will be called only if
HAVE_WORKING_LINK is not used, which happens only on Windows and
cygwin. We could add it for consistency, but it should be within the
#else/#endif block. It is not critical as of now.
2) The call in RemoveXlogFile is not necessary, the rename happening
only on Windows.

Hmmm, OK. Are we sure HAVE_WORKING_LINK is false only on Windows, or could
there be some other platforms? And are we sure the file systems on those
platforms are safe without the fsync call?

That is, while the report references ext4, there may be other file systems
with the same problem - ext4 was used mostly as it's the most widely used
Linux file system.

3) In exitArchiveRecovery if the rename is not made durable I think

it does not matter much. Even if recovery.conf is the one present
once the node restarts node would move back again to recovery, and
actually we had better move back to recovery in this case, no?

I'm strongly against this "optimization" - I'm more than happy to exchange
the one fsync for not having to manually fix the database after crash.

I don't really see why switching back to recovery should be desirable in
this case? Imagine you have a warm/hot standby, and that you promote it to
master. The clients connect, start issuing commands and then the system
crashes and loses the recovery.conf rename. The system reboots, database
performs local recovery but then starts as a standby and starts rejecting
writes. That seems really weird to me.

4) StartupXLOG for the tablespace map. I don't think this one is

needed as well. Even if the tablespace map is not removed after a
power loss user would get an error telling that the file should be
removed.

Please no, for the same reasons as in (3).

5) For the one where haveBackupLabel || haveTblspcMap. If we do the

fsync, we guarantee that there is no need to do again the recovery.
But in case of a power loss, isn't it better to do the recovery again?

Why would it be better? Why should we do something twice when we don't
have to? Had this not be reliable, then the whole recovery process is
fundamentally broken and we better fix it instead of merely putting a
band-aid on it.

6) For the one after XLogArchiveNotify() for the last partial

segment of the old timeline, it doesn't really matter to not make the
change persistent as this is mainly done because it is useful to
identify that it is a partial segment.

OK, although I still don't quite see why that should be a reason not to do
the fsync. It's not really going to give us any measurable performance
advantage (how often we do those fsyncs), so I'd vote to keep it and make
sure the partial segments are named accordingly. Less confusion is always
better.

7) In CancelBackup, this one is not needed as well I think. We would

surely want to get back to recovery if those files remain after a
power loss.

I may be missing something, but why would we switch to recovery in this
case?

For the ones in xlogarchive.c:
1) For KeepFileRestoredFromArchive, it does not matter here, we are
renaming a file with a temporary name to a permanent name. Once the
node restarts, we would see an extra temporary file if the rename
was not effective.

So we'll lose the segment (won't have it locally under the permanent
name), as we've already restored it and won't do that again. Is that really
a good thing to do?

2) XLogArchiveForceDone, the only bad thing that would happen here is

to leave behind a .ready file instead of a .done file. I guess that we
could have it though as an optimization to not have to archive again
this file.

Yes.

For the one in pgarch.c:
1) In pgarch_archiveDone, we could add it as an optimization to
actually let the server know that the segment has been already
archived, preventing a retry.

Not sure what you mean by "could add it as an optimization"?

In timeline.c:

1) writeTimeLineHistoryFile, it does not matter much. In the worst
case we would have just a dummy temporary file, and startup process
would come back here (in exitArchiveRecovery() we may finish with an
unnamed backup file similarly).

OK

2) In writeTimeLineHistoryFile, similarly we don't need to care

much, in WalRcvFetchTimeLineHistoryFiles recovery would take again
the same path

OK

Thoughts?

Thanks for the review and comments. I think the question is whether we
only want to do the additional fsync() only when it ultimately may lead to
data loss, or even in cases where it may cause operational issues (e.g.
switching back to recovery needlessly).

I'd vote for the latter, as I think it makes the database easier to
operate (less manual interventions) and the performance impact is 0 (as
those fsyncs are really rare).

Yeah, unless it gives a significant performance penalty, I'd agree that the
latter seems like the better option of those. +1 for that way :)

--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/

#25Michael Paquier
michael.paquier@gmail.com
In reply to: Tomas Vondra (#23)
Re: silent data loss with ext4 / all current versions

On Fri, Jan 22, 2016 at 5:26 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On 01/22/2016 06:45 AM, Michael Paquier wrote:

Here are some comments about your patch after a look at the code.

Regarding the additions in fsync_fname() in xlog.c:
1) In InstallXLogFileSegment, rename() will be called only if
HAVE_WORKING_LINK is not used, which happens only on Windows and
cygwin. We could add it for consistency, but it should be within the
#else/#endif block. It is not critical as of now.
2) The call in RemoveXlogFile is not necessary, the rename happening
only on Windows.

Hmmm, OK. Are we sure HAVE_WORKING_LINK is false only on Windows, or could
there be some other platforms? And are we sure the file systems on those
platforms are safe without the fsync call?
That is, while the report references ext4, there may be other file systems
with the same problem - ext4 was used mostly as it's the most widely used
Linux file system.

From pg_config_manual.h:
#if !defined(WIN32) && !defined(__CYGWIN__)
#define HAVE_WORKING_LINK 1
#endif
If we want to be consistent with what Posix proposes, I am not against
adding it.

3) In exitArchiveRecovery if the rename is not made durable I think
it does not matter much. Even if recovery.conf is the one present
once the node restarts node would move back again to recovery, and
actually we had better move back to recovery in this case, no?

I'm strongly against this "optimization" - I'm more than happy to exchange
the one fsync for not having to manually fix the database after crash.

I don't really see why switching back to recovery should be desirable in
this case? Imagine you have a warm/hot standby, and that you promote it to
master. The clients connect, start issuing commands and then the system
crashes and loses the recovery.conf rename. The system reboots, database
performs local recovery but then starts as a standby and starts rejecting
writes. That seems really weird to me.

4) StartupXLOG for the tablespace map. I don't think this one is
needed as well. Even if the tablespace map is not removed after a
power loss user would get an error telling that the file should be
removed.

Please no, for the same reasons as in (3).

5) For the one where haveBackupLabel || haveTblspcMap. If we do the
fsync, we guarantee that there is no need to do again the recovery.
But in case of a power loss, isn't it better to do the recovery again?

Why would it be better? Why should we do something twice when we don't have
to? Had this not be reliable, then the whole recovery process is
fundamentally broken and we better fix it instead of merely putting a
band-aid on it.

Group shot with 3), 4) and 5). Well, there is no data loss here,
putting me in the direction of considering this addition of an fsync
as an optimization and not a bug.

6) For the one after XLogArchiveNotify() for the last partial
segment of the old timeline, it doesn't really matter to not make the
change persistent as this is mainly done because it is useful to
identify that it is a partial segment.

OK, although I still don't quite see why that should be a reason not to do
the fsync. It's not really going to give us any measurable performance
advantage (how often we do those fsyncs), so I'd vote to keep it and make
sure the partial segments are named accordingly. Less confusion is always
better.

Check.

7) In CancelBackup, this one is not needed as well I think. We would
surely want to get back to recovery if those files remain after a
power loss.

I may be missing something, but why would we switch to recovery in this
case?

For the ones in xlogarchive.c:
1) For KeepFileRestoredFromArchive, it does not matter here, we are
renaming a file with a temporary name to a permanent name. Once the
node restarts, we would see an extra temporary file if the rename
was not effective.

So we'll lose the segment (won't have it locally under the permanent name),
as we've already restored it and won't do that again. Is that really a good
thing to do?

At this point if a segment is restored from archive and there is a
power loss we are going back to recovery. The advantage of having the
fsync would ensure that the segment is not fetched twice.

2) XLogArchiveForceDone, the only bad thing that would happen here is
to leave behind a .ready file instead of a .done file. I guess that we
could have it though as an optimization to not have to archive again
this file.

Yes.

For the one in pgarch.c:
1) In pgarch_archiveDone, we could add it as an optimization to
actually let the server know that the segment has been already
archived, preventing a retry.

Not sure what you mean by "could add it as an optimization"?

In case of a loss of the rename, server would retry archiving it. So
adding an fsync here ensures that the segment is marked as .done for
good.

Thoughts?

Thanks for the review and comments. I think the question is whether we only
want to do the additional fsync() only when it ultimately may lead to data
loss, or even in cases where it may cause operational issues (e.g. switching
back to recovery needlessly).
I'd vote for the latter, as I think it makes the database easier to operate
(less manual interventions) and the performance impact is 0 (as those fsyncs
are really rare).

My first line of thoughts after looking at the patch is that I am not
against adding those fsync calls on HEAD as there is roughly an
advantage to not go back to recovery in most cases and ensure
consistent names, but as they do not imply any data loss I would not
encourage a back-patch. Adding them seems harmless at first sight I
agree, but those are not actual bugs.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#26Greg Stark
stark@mit.edu
In reply to: Tomas Vondra (#23)
Re: silent data loss with ext4 / all current versions

On Fri, Jan 22, 2016 at 8:26 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On 01/22/2016 06:45 AM, Michael Paquier wrote:

So, I have been playing with a Linux VM with VMware Fusion and on
ext4 with data=ordered the renames are getting lost if the root
folder is not fsync. By killing-9 the VM I am able to reproduce that
really easily.

Yep. Same experience here (with qemu-kvm VMs).

I still think a better approach for this is to run the database on an
LVM volume and take lots of snapshots. No VM needed, though it doesn't
hurt. LVM volumes are below the level of the filesystem and a snapshot
captures the state of the raw blocks the filesystem has written to the
block layer. The block layer does no caching though the drive may but
neither the VM solution nor LVM would capture that.

LVM snapshots would have the advantage that you can keep running the
database and you can take lots of snapshots with relatively little
overhead. Having dozens or hundreds of snapshots would be unacceptable
performance drain in production but for testing it should be practical
and they take relatively little space -- just the blocks changed since
the snapshot was taken.

--
greg

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#27Andres Freund
andres@anarazel.de
In reply to: Michael Paquier (#25)
Re: silent data loss with ext4 / all current versions

On 2016-01-22 21:32:29 +0900, Michael Paquier wrote:

Group shot with 3), 4) and 5). Well, there is no data loss here,
putting me in the direction of considering this addition of an fsync
as an optimization and not a bug.

I think this is an extremely weak argument. The reasoning when exactly a
loss of file is acceptable is complicated. In many cases adding an
additional fsync won't add measurable cost, given the frequency of
operations and/or the cost of surrounding operations.

Now, if you can make an argument why something is potentially impacting
performance *and* definitely not required: OK, then we can discuss
that.

Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#28Michael Paquier
michael.paquier@gmail.com
In reply to: Greg Stark (#26)
Re: silent data loss with ext4 / all current versions

On Fri, Jan 22, 2016 at 9:41 PM, Greg Stark <stark@mit.edu> wrote:

On Fri, Jan 22, 2016 at 8:26 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On 01/22/2016 06:45 AM, Michael Paquier wrote:

So, I have been playing with a Linux VM with VMware Fusion and on
ext4 with data=ordered the renames are getting lost if the root
folder is not fsync. By killing-9 the VM I am able to reproduce that
really easily.

Yep. Same experience here (with qemu-kvm VMs).

I still think a better approach for this is to run the database on an
LVM volume and take lots of snapshots. No VM needed, though it doesn't
hurt. LVM volumes are below the level of the filesystem and a snapshot
captures the state of the raw blocks the filesystem has written to the
block layer. The block layer does no caching though the drive may but
neither the VM solution nor LVM would capture that.

LVM snapshots would have the advantage that you can keep running the
database and you can take lots of snapshots with relatively little
overhead. Having dozens or hundreds of snapshots would be unacceptable
performance drain in production but for testing it should be practical
and they take relatively little space -- just the blocks changed since
the snapshot was taken.

Another idea: hardcode a PANIC just after rename() with
restart_after_crash = off (this needs is IsBootstrapProcess() checks).
Once server crashes, kill-9 the VM. Then restart the VM and the
Postgres instance with a new binary that does not have the PANIC, and
see how things are moving on. There is a window of up to several
seconds after the rename() call, so I guess that this would work.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#29Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Michael Paquier (#28)
Re: silent data loss with ext4 / all current versions

On 01/23/2016 02:35 AM, Michael Paquier wrote:

On Fri, Jan 22, 2016 at 9:41 PM, Greg Stark <stark@mit.edu> wrote:

On Fri, Jan 22, 2016 at 8:26 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On 01/22/2016 06:45 AM, Michael Paquier wrote:

So, I have been playing with a Linux VM with VMware Fusion and on
ext4 with data=ordered the renames are getting lost if the root
folder is not fsync. By killing-9 the VM I am able to reproduce that
really easily.

Yep. Same experience here (with qemu-kvm VMs).

I still think a better approach for this is to run the database on an
LVM volume and take lots of snapshots. No VM needed, though it doesn't
hurt. LVM volumes are below the level of the filesystem and a snapshot
captures the state of the raw blocks the filesystem has written to the
block layer. The block layer does no caching though the drive may but
neither the VM solution nor LVM would capture that.

LVM snapshots would have the advantage that you can keep running the
database and you can take lots of snapshots with relatively little
overhead. Having dozens or hundreds of snapshots would be unacceptable
performance drain in production but for testing it should be practical
and they take relatively little space -- just the blocks changed since
the snapshot was taken.

Another idea: hardcode a PANIC just after rename() with
restart_after_crash = off (this needs is IsBootstrapProcess() checks).
Once server crashes, kill-9 the VM. Then restart the VM and the
Postgres instance with a new binary that does not have the PANIC, and
see how things are moving on. There is a window of up to several
seconds after the rename() call, so I guess that this would work.

I don't see how that would improve anything, as the PANIC has no impact
on the I/O requests already issued to the system. What you need is some
sort of coordination between the database and the script that kills the
VM (or takes a LVM snapshot).

That can be done by simply emitting a particular log message, and the
"kill script" may simply watch the file (for example over SSH). This has
the benefit that you can also watch for additional conditions that are
difficult to check from that particular part of the code (and only kill
the VM when all of them trigger - for example only on the third
checkpoint since start, and such).

The reason why I was not particularly thrilled about the LVM snapshot
idea is that to identify this particular data loss issue, you need to be
able to reason about the expected state of the database (what
transactions are committed, how many segments are there). And my
understanding was that Greg's idea was merely "try to start the DB on a
snapshot and see if starts / is not corrupted," which would not work
with this particular issue, as the database seemed just fine - the data
loss is silent. Adding the "last XLOG segment" into pg_controldata would
make it easier to detect without having to track details about which
transactions got committed.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#30Michael Paquier
michael.paquier@gmail.com
In reply to: Tomas Vondra (#29)
Re: silent data loss with ext4 / all current versions

On Sat, Jan 23, 2016 at 11:39 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On 01/23/2016 02:35 AM, Michael Paquier wrote:

On Fri, Jan 22, 2016 at 9:41 PM, Greg Stark <stark@mit.edu> wrote:

On Fri, Jan 22, 2016 at 8:26 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
LVM snapshots would have the advantage that you can keep running the
database and you can take lots of snapshots with relatively little
overhead. Having dozens or hundreds of snapshots would be unacceptable
performance drain in production but for testing it should be practical
and they take relatively little space -- just the blocks changed since
the snapshot was taken.

Another idea: hardcode a PANIC just after rename() with
restart_after_crash = off (this needs is IsBootstrapProcess() checks).
Once server crashes, kill-9 the VM. Then restart the VM and the
Postgres instance with a new binary that does not have the PANIC, and
see how things are moving on. There is a window of up to several
seconds after the rename() call, so I guess that this would work.

I don't see how that would improve anything, as the PANIC has no impact on
the I/O requests already issued to the system. What you need is some sort of
coordination between the database and the script that kills the VM (or takes
a LVM snapshot).

Well, to emulate the noise that non-renamed files have on the system
we could simply emulate the loss of rename() by just commenting it out
and then forcibly crash the instance or just PANIC the instance just
before rename(). This would emulate what we are looking for, no? What
we want to check is how the system reacts should an unwanted file be
in place.
For example, take the rename() call in InstallXLogFileSegment(),
crashing with an non-effective rename() will cause the presence of an
annoying xlogtemp file. Making the rename persistent would make the
server complain about an invalid magic number in a segment that has
just been created.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#31Michael Paquier
michael.paquier@gmail.com
In reply to: Michael Paquier (#25)
1 attachment(s)
Re: silent data loss with ext4 / all current versions

On Fri, Jan 22, 2016 at 9:32 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Fri, Jan 22, 2016 at 5:26 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On 01/22/2016 06:45 AM, Michael Paquier wrote:

Here are some comments about your patch after a look at the code.

Regarding the additions in fsync_fname() in xlog.c:
1) In InstallXLogFileSegment, rename() will be called only if
HAVE_WORKING_LINK is not used, which happens only on Windows and
cygwin. We could add it for consistency, but it should be within the
#else/#endif block. It is not critical as of now.
2) The call in RemoveXlogFile is not necessary, the rename happening
only on Windows.

Hmmm, OK. Are we sure HAVE_WORKING_LINK is false only on Windows, or could
there be some other platforms? And are we sure the file systems on those
platforms are safe without the fsync call?
That is, while the report references ext4, there may be other file systems
with the same problem - ext4 was used mostly as it's the most widely used
Linux file system.

From pg_config_manual.h:
#if !defined(WIN32) && !defined(__CYGWIN__)
#define HAVE_WORKING_LINK 1
#endif
If we want to be consistent with what Posix proposes, I am not against
adding it.

I did some tests with NTFS using cygwin, and the rename() calls remain
even after powering off the VM. But I agree that adding an fsync() in
both cases would be fine.

Thoughts?

Thanks for the review and comments. I think the question is whether we only
want to do the additional fsync() only when it ultimately may lead to data
loss, or even in cases where it may cause operational issues (e.g. switching
back to recovery needlessly).
I'd vote for the latter, as I think it makes the database easier to operate
(less manual interventions) and the performance impact is 0 (as those fsyncs
are really rare).

My first line of thoughts after looking at the patch is that I am not
against adding those fsync calls on HEAD as there is roughly an
advantage to not go back to recovery in most cases and ensure
consistent names, but as they do not imply any data loss I would not
encourage a back-patch. Adding them seems harmless at first sight I
agree, but those are not actual bugs.

OK. It is true that PGDATA would be fsync'd in 4 code paths with your
patch which are not that much taken:
- Renaming tablespace map file and backup label file (three times)
- Renaming to recovery.done
So, what do you think about the patch attached? Moving the calls into
the critical sections is not really necessary except when installing a
new segment.
--
Michael

Attachments:

xlog-fsync-v3.patchtext/x-patch; charset=US-ASCII; name=xlog-fsync-v3.patchDownload
diff --git a/src/backend/access/transam/timeline.c b/src/backend/access/transam/timeline.c
index f6da673..4173a50 100644
--- a/src/backend/access/transam/timeline.c
+++ b/src/backend/access/transam/timeline.c
@@ -435,6 +435,12 @@ writeTimeLineHistory(TimeLineID newTLI, TimeLineID parentTLI,
 				(errcode_for_file_access(),
 				 errmsg("could not rename file \"%s\" to \"%s\": %m",
 						tmppath, path)));
+
+	/*
+	 * Make sure the rename is permanent by fsyncing the parent
+	 * directory.
+	 */
+	fsync_fname(XLOGDIR, true);
 #endif
 
 	/* The history file can be archived immediately. */
@@ -525,6 +531,9 @@ writeTimeLineHistoryFile(TimeLineID tli, char *content, int size)
 				(errcode_for_file_access(),
 				 errmsg("could not rename file \"%s\" to \"%s\": %m",
 						tmppath, path)));
+
+	/* Make sure the rename is permanent by fsyncing the directory. */
+	fsync_fname(XLOGDIR, true);
 #endif
 }
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index a2846c4..b124f90 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -3278,6 +3278,14 @@ InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
 						tmppath, path)));
 		return false;
 	}
+
+	/*
+	 * Make sure the rename is permanent by fsyncing the parent
+	 * directory.
+	 */
+	START_CRIT_SECTION();
+	fsync_fname(XLOGDIR, true);
+	END_CRIT_SECTION();
 #endif
 
 	if (use_lock)
@@ -3800,10 +3808,18 @@ RemoveXlogFile(const char *segname, XLogRecPtr PriorRedoPtr, XLogRecPtr endptr)
 					  path)));
 			return;
 		}
+
+		/*
+		 * Make sure the rename is permanent by fsyncing the parent
+		 * directory.
+		 */
+		fsync_fname(XLOGDIR, true);
+
 		rc = unlink(newpath);
 #else
 		rc = unlink(path);
 #endif
+
 		if (rc != 0)
 		{
 			ereport(LOG,
@@ -5297,6 +5313,9 @@ exitArchiveRecovery(TimeLineID endTLI, XLogRecPtr endOfLog)
 				 errmsg("could not rename file \"%s\" to \"%s\": %m",
 						RECOVERY_COMMAND_FILE, RECOVERY_COMMAND_DONE)));
 
+	/* Make sure the rename is permanent by fsyncing the data directory. */
+	fsync_fname(".", true);
+
 	ereport(LOG,
 			(errmsg("archive recovery complete")));
 }
@@ -6150,6 +6169,12 @@ StartupXLOG(void)
 								TABLESPACE_MAP, BACKUP_LABEL_FILE),
 						 errdetail("Could not rename file \"%s\" to \"%s\": %m.",
 								   TABLESPACE_MAP, TABLESPACE_MAP_OLD)));
+
+			/*
+			 * Make sure the rename is permanent by fsyncing the data
+			 * directory.
+			 */
+			fsync_fname(".", true);
 		}
 
 		/*
@@ -6525,6 +6550,13 @@ StartupXLOG(void)
 								TABLESPACE_MAP, TABLESPACE_MAP_OLD)));
 		}
 
+		/*
+		 * Make sure the rename is permanent by fsyncing the parent
+		 * directory.
+		 */
+		if (haveBackupLabel || haveTblspcMap)
+			fsync_fname(".", true);
+
 		/* Check that the GUCs used to generate the WAL allow recovery */
 		CheckRequiredParameterValues();
 
@@ -7305,6 +7337,12 @@ StartupXLOG(void)
 						 errmsg("could not rename file \"%s\" to \"%s\": %m",
 								origpath, partialpath)));
 				XLogArchiveNotify(partialfname);
+
+				/*
+				 * Make sure the rename is permanent by fsyncing the parent
+				 * directory.
+				 */
+				fsync_fname(XLOGDIR, true);
 			}
 		}
 	}
@@ -10905,6 +10943,9 @@ CancelBackup(void)
 						   BACKUP_LABEL_FILE, BACKUP_LABEL_OLD,
 						   TABLESPACE_MAP, TABLESPACE_MAP_OLD)));
 	}
+
+	/* fsync the data directory to persist the renames */
+	fsync_fname(".", true);
 }
 
 /*
diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c
index 277c14a..8dda80b 100644
--- a/src/backend/access/transam/xlogarchive.c
+++ b/src/backend/access/transam/xlogarchive.c
@@ -477,6 +477,12 @@ KeepFileRestoredFromArchive(char *path, char *xlogfname)
 						path, xlogfpath)));
 
 	/*
+	 * Make sure the renames are permanent by fsyncing the parent
+	 * directory.
+	 */
+	fsync_fname(XLOGDIR, true);
+
+	/*
 	 * Create .done file forcibly to prevent the restored segment from being
 	 * archived again later.
 	 */
@@ -586,6 +592,11 @@ XLogArchiveForceDone(const char *xlog)
 					 errmsg("could not rename file \"%s\" to \"%s\": %m",
 							archiveReady, archiveDone)));
 
+		/*
+		 * Make sure the rename is permanent by fsyncing the parent
+		 * directory.
+		 */
+		fsync_fname(XLOGDIR "/archive_status", true);
 		return;
 	}
 
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 397f802..7165c74 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -733,4 +733,10 @@ pgarch_archiveDone(char *xlog)
 				(errcode_for_file_access(),
 				 errmsg("could not rename file \"%s\" to \"%s\": %m",
 						rlogready, rlogdone)));
+
+	/*
+	 * Make sure the rename is permanent by fsyncing the parent
+	 * directory.
+	 */
+	fsync_fname(XLOGDIR "/archive_status", true);
 }
#32Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Michael Paquier (#31)
Re: silent data loss with ext4 / all current versions

On 01/25/2016 08:30 AM, Michael Paquier wrote:

On Fri, Jan 22, 2016 at 9:32 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

,,,

My first line of thoughts after looking at the patch is that I am
not against adding those fsync calls on HEAD as there is roughly
an advantage to not go back to recovery in most cases and ensure
consistent names, but as they do not imply any data loss I would
not encourage a back-patch. Adding them seems harmless at first
sight I agree, but those are not actual bugs.

OK. It is true that PGDATA would be fsync'd in 4 code paths with your
patch which are not that much taken:
- Renaming tablespace map file and backup label file (three times)
- Renaming to recovery.done
So, what do you think about the patch attached? Moving the calls into
the critical sections is not really necessary except when installing a
new segment.

Seems OK to me. Thanks for the time and improvements!

Tomas

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#33Michael Paquier
michael.paquier@gmail.com
In reply to: Tomas Vondra (#32)
Re: silent data loss with ext4 / all current versions

On Mon, Jan 25, 2016 at 6:50 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

Seems OK to me. Thanks for the time and improvements!

Thanks. Perhaps a committer could have a look then? I have switched
the patch as such in the CF app. Seeing the accumulated feedback
upthread that's something that should be backpatched.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#34Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Michael Paquier (#33)
Re: silent data loss with ext4 / all current versions

Michael Paquier wrote:

On Mon, Jan 25, 2016 at 6:50 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

Seems OK to me. Thanks for the time and improvements!

Thanks. Perhaps a committer could have a look then? I have switched
the patch as such in the CF app. Seeing the accumulated feedback
upthread that's something that should be backpatched.

Yeah. On 9.4 there are already some conflicts, and I'm sure there will
be more in almost each branch. Does anyone want to volunteer for
producing per-branch versions?

The next minor release is to be tagged next week and it'd be good to put
this fix there.

Thanks,

--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#35Andres Freund
andres@anarazel.de
In reply to: Michael Paquier (#31)
Re: silent data loss with ext4 / all current versions

On 2016-01-25 16:30:47 +0900, Michael Paquier wrote:

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index a2846c4..b124f90 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -3278,6 +3278,14 @@ InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
tmppath, path)));
return false;
}
+
+	/*
+	 * Make sure the rename is permanent by fsyncing the parent
+	 * directory.
+	 */
+	START_CRIT_SECTION();
+	fsync_fname(XLOGDIR, true);
+	END_CRIT_SECTION();
#endif

Hm. I'm seriously doubtful that using critical sections for this is a
good idea. What's the scenario you're protecting against by declaring
this one? We intentionally don't error out in the isdir cases in
fsync_fname() cases anyway.

Afaik we need to fsync tmppath before and after the rename, and only
then the directory, to actually be safe.

@@ -5297,6 +5313,9 @@ exitArchiveRecovery(TimeLineID endTLI, XLogRecPtr endOfLog)
errmsg("could not rename file \"%s\" to \"%s\": %m",
RECOVERY_COMMAND_FILE, RECOVERY_COMMAND_DONE)));

+	/* Make sure the rename is permanent by fsyncing the data directory. */
+	fsync_fname(".", true);
+

Shouldn't RECOVERY_COMMAND_DONE be fsynced first here?

ereport(LOG,
(errmsg("archive recovery complete")));
}
@@ -6150,6 +6169,12 @@ StartupXLOG(void)
TABLESPACE_MAP, BACKUP_LABEL_FILE),
errdetail("Could not rename file \"%s\" to \"%s\": %m.",
TABLESPACE_MAP, TABLESPACE_MAP_OLD)));
+
+			/*
+			 * Make sure the rename is permanent by fsyncing the data
+			 * directory.
+			 */
+			fsync_fname(".", true);
}

Is it just me, or are the repeated four line comments a bit grating?

/*
@@ -6525,6 +6550,13 @@ StartupXLOG(void)
TABLESPACE_MAP, TABLESPACE_MAP_OLD)));
}

+		/*
+		 * Make sure the rename is permanent by fsyncing the parent
+		 * directory.
+		 */
+		if (haveBackupLabel || haveTblspcMap)
+			fsync_fname(".", true);
+

Isn't that redundant with the haveTblspcMap case above?

/* Check that the GUCs used to generate the WAL allow recovery */
CheckRequiredParameterValues();

@@ -7305,6 +7337,12 @@ StartupXLOG(void)
errmsg("could not rename file \"%s\" to \"%s\": %m",
origpath, partialpath)));
XLogArchiveNotify(partialfname);
+
+				/*
+				 * Make sure the rename is permanent by fsyncing the parent
+				 * directory.
+				 */
+				fsync_fname(XLOGDIR, true);

.partial should be fsynced first.

}
}
}
@@ -10905,6 +10943,9 @@ CancelBackup(void)
BACKUP_LABEL_FILE, BACKUP_LABEL_OLD,
TABLESPACE_MAP, TABLESPACE_MAP_OLD)));
}
+
+	/* fsync the data directory to persist the renames */
+	fsync_fname(".", true);
}

Same.

/*
diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c
index 277c14a..8dda80b 100644
--- a/src/backend/access/transam/xlogarchive.c
+++ b/src/backend/access/transam/xlogarchive.c
@@ -477,6 +477,12 @@ KeepFileRestoredFromArchive(char *path, char *xlogfname)
path, xlogfpath)));
/*
+	 * Make sure the renames are permanent by fsyncing the parent
+	 * directory.
+	 */
+	fsync_fname(XLOGDIR, true);

Afaics the file under the temporary filename has not been fsynced up to
here.

+ /*
* Create .done file forcibly to prevent the restored segment from being
* archived again later.
*/
@@ -586,6 +592,11 @@ XLogArchiveForceDone(const char *xlog)
errmsg("could not rename file \"%s\" to \"%s\": %m",
archiveReady, archiveDone)));

+		/*
+		 * Make sure the rename is permanent by fsyncing the parent
+		 * directory.
+		 */
+		fsync_fname(XLOGDIR "/archive_status", true);
return;
}

Afaics the AllocateFile() call below is not protected at all, no?

How about introducing a 'safe_rename()' that does something roughly akin
to:
fsync(oldname);
fsync(fname) || true;
rename(oldfname, fname);
fsync(fname);
fsync(basename(fname));

I'd rather have that kind of logic somewhere once, instead of repeated a
dozen times...

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#36Andres Freund
andres@anarazel.de
In reply to: Alvaro Herrera (#34)
Re: silent data loss with ext4 / all current versions

On 2016-02-01 16:49:46 +0100, Alvaro Herrera wrote:

Yeah. On 9.4 there are already some conflicts, and I'm sure there will
be more in almost each branch. Does anyone want to volunteer for
producing per-branch versions?

The next minor release is to be tagged next week and it'd be good to put
this fix there.

I don't think this is going to be ready for that. The risk of hurrying
this through seems higher than the loss risk at this point.

Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#37Michael Paquier
michael.paquier@gmail.com
In reply to: Andres Freund (#36)
Re: silent data loss with ext4 / all current versions

On Tue, Feb 2, 2016 at 1:08 AM, Andres Freund <andres@anarazel.de> wrote:

On 2016-02-01 16:49:46 +0100, Alvaro Herrera wrote:

Yeah. On 9.4 there are already some conflicts, and I'm sure there will
be more in almost each branch. Does anyone want to volunteer for
producing per-branch versions?

The next minor release is to be tagged next week and it'd be good to put
this fix there.

I don't think this is going to be ready for that. The risk of hurrying
this through seems higher than the loss risk at this point.

Agreed. And there is no actual risk of data loss, so let's not hurry
and be sure that the right think is pushed.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#38Michael Paquier
michael.paquier@gmail.com
In reply to: Alvaro Herrera (#34)
Re: silent data loss with ext4 / all current versions

On Tue, Feb 2, 2016 at 12:49 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

Michael Paquier wrote:

On Mon, Jan 25, 2016 at 6:50 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

Seems OK to me. Thanks for the time and improvements!

Thanks. Perhaps a committer could have a look then? I have switched
the patch as such in the CF app. Seeing the accumulated feedback
upthread that's something that should be backpatched.

Yeah. On 9.4 there are already some conflicts, and I'm sure there will
be more in almost each branch. Does anyone want to volunteer for
producing per-branch versions?

I don't mind doing it once we have something fully-bloomed for master,
something that I guess is very likely to apply easily to 9.5 as well.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#39Andres Freund
andres@anarazel.de
In reply to: Michael Paquier (#37)
Re: silent data loss with ext4 / all current versions

On 2016-02-02 09:56:40 +0900, Michael Paquier wrote:

And there is no actual risk of data loss

Huh?

- Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#40Michael Paquier
michael.paquier@gmail.com
In reply to: Andres Freund (#35)
Re: silent data loss with ext4 / all current versions

On Tue, Feb 2, 2016 at 1:07 AM, Andres Freund <andres@anarazel.de> wrote:

On 2016-01-25 16:30:47 +0900, Michael Paquier wrote:

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index a2846c4..b124f90 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -3278,6 +3278,14 @@ InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
tmppath, path)));
return false;
}
+
+     /*
+      * Make sure the rename is permanent by fsyncing the parent
+      * directory.
+      */
+     START_CRIT_SECTION();
+     fsync_fname(XLOGDIR, true);
+     END_CRIT_SECTION();
#endif

Hm. I'm seriously doubtful that using critical sections for this is a
good idea. What's the scenario you're protecting against by declaring
this one? We intentionally don't error out in the isdir cases in
fsync_fname() cases anyway.

Afaik we need to fsync tmppath before and after the rename, and only
then the directory, to actually be safe.

Regarding the fsync call on the new file before the rename, would it
be better to extend fsync_fname() with some kind of noerror flag to
work around the case of a file that does not exist or do you think it
is better just to use pg_fsync in this case after getting an fd? Using
directly pg_fsync() looks redundant with what fsync_fname does
already.

@@ -5297,6 +5313,9 @@ exitArchiveRecovery(TimeLineID endTLI, XLogRecPtr endOfLog)
errmsg("could not rename file \"%s\" to \"%s\": %m",
RECOVERY_COMMAND_FILE, RECOVERY_COMMAND_DONE)));

+     /* Make sure the rename is permanent by fsyncing the data directory. */
+     fsync_fname(".", true);
+

Shouldn't RECOVERY_COMMAND_DONE be fsynced first here?

Check.

/*
@@ -6525,6 +6550,13 @@ StartupXLOG(void)
TABLESPACE_MAP, TABLESPACE_MAP_OLD)));
}

+             /*
+              * Make sure the rename is permanent by fsyncing the parent
+              * directory.
+              */
+             if (haveBackupLabel || haveTblspcMap)
+                     fsync_fname(".", true);
+

Isn't that redundant with the haveTblspcMap case above?

I am not sure I get your point here. Are you referring to the fact
that fsync should be done after each rename in this case?

/* Check that the GUCs used to generate the WAL allow recovery */
CheckRequiredParameterValues();

@@ -7305,6 +7337,12 @@ StartupXLOG(void)
errmsg("could not rename file \"%s\" to \"%s\": %m",
origpath, partialpath)));
XLogArchiveNotify(partialfname);
+
+                             /*
+                              * Make sure the rename is permanent by fsyncing the parent
+                              * directory.
+                              */
+                             fsync_fname(XLOGDIR, true);

.partial should be fsynced first.

Check.

}
}
}
@@ -10905,6 +10943,9 @@ CancelBackup(void)
BACKUP_LABEL_FILE, BACKUP_LABEL_OLD,
TABLESPACE_MAP, TABLESPACE_MAP_OLD)));
}
+
+     /* fsync the data directory to persist the renames */
+     fsync_fname(".", true);
}

Same.

Re-check.

/*
diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c
index 277c14a..8dda80b 100644
--- a/src/backend/access/transam/xlogarchive.c
+++ b/src/backend/access/transam/xlogarchive.c
@@ -477,6 +477,12 @@ KeepFileRestoredFromArchive(char *path, char *xlogfname)
path, xlogfpath)));
/*
+      * Make sure the renames are permanent by fsyncing the parent
+      * directory.
+      */
+     fsync_fname(XLOGDIR, true);

Afaics the file under the temporary filename has not been fsynced up to
here.

Yes, true, the old file...

+ /*
* Create .done file forcibly to prevent the restored segment from being
* archived again later.
*/
@@ -586,6 +592,11 @@ XLogArchiveForceDone(const char *xlog)
errmsg("could not rename file \"%s\" to \"%s\": %m",
archiveReady, archiveDone)));

+             /*
+              * Make sure the rename is permanent by fsyncing the parent
+              * directory.
+              */
+             fsync_fname(XLOGDIR "/archive_status", true);
return;
}

Afaics the AllocateFile() call below is not protected at all, no?

Yep.

How about introducing a 'safe_rename()' that does something roughly akin
to:
fsync(oldname);
fsync(fname) || true;
rename(oldfname, fname);
fsync(fname);
fsync(basename(fname));

I'd rather have that kind of logic somewhere once, instead of repeated a
dozen times...

Not wrong, and this leads to the following:
void rename_safe(const char *old, const char *new, bool isdir, int elevel);
Controlling elevel is necessary per the multiple code paths that would
use it. Some use ERROR, most of them FATAL, and a bit of WARNING. Does
that look fine?
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#41Michael Paquier
michael.paquier@gmail.com
In reply to: Andres Freund (#39)
Re: silent data loss with ext4 / all current versions

On Tue, Feb 2, 2016 at 9:59 AM, Andres Freund <andres@anarazel.de> wrote:

On 2016-02-02 09:56:40 +0900, Michael Paquier wrote:

And there is no actual risk of data loss

Huh?

More precise: what I mean here is that should an OS crash or a power
failure happen, we would fall back to recovery at next restart, so we
would not actually *lose* data.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#42Michael Paquier
michael.paquier@gmail.com
In reply to: Michael Paquier (#40)
1 attachment(s)
Re: silent data loss with ext4 / all current versions

On Tue, Feb 2, 2016 at 4:20 PM, Michael Paquier wrote:

Not wrong, and this leads to the following:
void rename_safe(const char *old, const char *new, bool isdir, int elevel);
Controlling elevel is necessary per the multiple code paths that would
use it. Some use ERROR, most of them FATAL, and a bit of WARNING. Does
that look fine?

After really coding it, I finished with the following thing:
+int
+rename_safe(const char *old, const char *new)

There is no need to extend that for directories, well we could of
course but all the renames happen on files so I see no need to make
that more complicated. More refactoring of the other rename() calls
could be done as well by extending rename_safe() with a flag to fsync
the data within a critical section, particularly for the replication
slot code. I have let that out to not complicate more the patch.
--
Michael

Attachments:

xlog-fsync-v4.patchtext/x-diff; charset=US-ASCII; name=xlog-fsync-v4.patchDownload
diff --git a/src/backend/access/transam/timeline.c b/src/backend/access/transam/timeline.c
index f6da673..11287aa 100644
--- a/src/backend/access/transam/timeline.c
+++ b/src/backend/access/transam/timeline.c
@@ -418,9 +418,10 @@ writeTimeLineHistory(TimeLineID newTLI, TimeLineID parentTLI,
 	TLHistoryFilePath(path, newTLI);
 
 	/*
-	 * Prefer link() to rename() here just to be really sure that we don't
-	 * overwrite an existing file.  However, there shouldn't be one, so
-	 * rename() is an acceptable substitute except for the truly paranoid.
+	 * Prefer link() to rename_safe() here just to be really sure that we
+	 * don't overwrite an existing file.  However, there shouldn't be one,
+	 * so rename_safe() is an acceptable substitute except for the truly
+	 * paranoid.
 	 */
 #if HAVE_WORKING_LINK
 	if (link(tmppath, path) < 0)
@@ -430,7 +431,7 @@ writeTimeLineHistory(TimeLineID newTLI, TimeLineID parentTLI,
 						tmppath, path)));
 	unlink(tmppath);
 #else
-	if (rename(tmppath, path) < 0)
+	if (rename_safe(tmppath, path) < 0)
 		ereport(ERROR,
 				(errcode_for_file_access(),
 				 errmsg("could not rename file \"%s\" to \"%s\": %m",
@@ -508,9 +509,10 @@ writeTimeLineHistoryFile(TimeLineID tli, char *content, int size)
 	TLHistoryFilePath(path, tli);
 
 	/*
-	 * Prefer link() to rename() here just to be really sure that we don't
-	 * overwrite an existing logfile.  However, there shouldn't be one, so
-	 * rename() is an acceptable substitute except for the truly paranoid.
+	 * Prefer link() to rename_safe() here just to be really sure that we
+	 * don't overwrite an existing logfile.  However, there shouldn't be
+	 * one, so rename_safe() is an acceptable substitute except for the
+	 * truly paranoid.
 	 */
 #if HAVE_WORKING_LINK
 	if (link(tmppath, path) < 0)
@@ -520,7 +522,7 @@ writeTimeLineHistoryFile(TimeLineID tli, char *content, int size)
 						tmppath, path)));
 	unlink(tmppath);
 #else
-	if (rename(tmppath, path) < 0)
+	if (rename_safe(tmppath, path) < 0)
 		ereport(ERROR,
 				(errcode_for_file_access(),
 				 errmsg("could not rename file \"%s\" to \"%s\": %m",
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index a2846c4..b61fcc7 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -3251,9 +3251,10 @@ InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
 	}
 
 	/*
-	 * Prefer link() to rename() here just to be really sure that we don't
-	 * overwrite an existing logfile.  However, there shouldn't be one, so
-	 * rename() is an acceptable substitute except for the truly paranoid.
+	 * Prefer link() to rename_safe() here just to be really sure that we
+	 * don't overwrite an existing logfile.  However, there shouldn't be
+	 * one, so rename_safe() is an acceptable substitute except for the
+	 * truly paranoid.
 	 */
 #if HAVE_WORKING_LINK
 	if (link(tmppath, path) < 0)
@@ -3268,7 +3269,7 @@ InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
 	}
 	unlink(tmppath);
 #else
-	if (rename(tmppath, path) < 0)
+	if (rename_safe(tmppath, path) < 0)
 	{
 		if (use_lock)
 			LWLockRelease(ControlFileLock);
@@ -3792,7 +3793,7 @@ RemoveXlogFile(const char *segname, XLogRecPtr PriorRedoPtr, XLogRecPtr endptr)
 		 * flag, rename will fail. We'll try again at the next checkpoint.
 		 */
 		snprintf(newpath, MAXPGPATH, "%s.deleted", path);
-		if (rename(path, newpath) != 0)
+		if (rename_safe(path, newpath) != 0)
 		{
 			ereport(LOG,
 					(errcode_for_file_access(),
@@ -3800,10 +3801,12 @@ RemoveXlogFile(const char *segname, XLogRecPtr PriorRedoPtr, XLogRecPtr endptr)
 					  path)));
 			return;
 		}
+
 		rc = unlink(newpath);
 #else
 		rc = unlink(path);
 #endif
+
 		if (rc != 0)
 		{
 			ereport(LOG,
@@ -5291,7 +5294,7 @@ exitArchiveRecovery(TimeLineID endTLI, XLogRecPtr endOfLog)
 	 * re-enter archive recovery mode in a subsequent crash.
 	 */
 	unlink(RECOVERY_COMMAND_DONE);
-	if (rename(RECOVERY_COMMAND_FILE, RECOVERY_COMMAND_DONE) != 0)
+	if (rename_safe(RECOVERY_COMMAND_FILE, RECOVERY_COMMAND_DONE) != 0)
 		ereport(FATAL,
 				(errcode_for_file_access(),
 				 errmsg("could not rename file \"%s\" to \"%s\": %m",
@@ -6138,7 +6141,7 @@ StartupXLOG(void)
 		if (stat(TABLESPACE_MAP, &st) == 0)
 		{
 			unlink(TABLESPACE_MAP_OLD);
-			if (rename(TABLESPACE_MAP, TABLESPACE_MAP_OLD) == 0)
+			if (rename_safe(TABLESPACE_MAP, TABLESPACE_MAP_OLD) == 0)
 				ereport(LOG,
 					(errmsg("ignoring file \"%s\" because no file \"%s\" exists",
 							TABLESPACE_MAP, BACKUP_LABEL_FILE),
@@ -6501,7 +6504,7 @@ StartupXLOG(void)
 		if (haveBackupLabel)
 		{
 			unlink(BACKUP_LABEL_OLD);
-			if (rename(BACKUP_LABEL_FILE, BACKUP_LABEL_OLD) != 0)
+			if (rename_safe(BACKUP_LABEL_FILE, BACKUP_LABEL_OLD) != 0)
 				ereport(FATAL,
 						(errcode_for_file_access(),
 						 errmsg("could not rename file \"%s\" to \"%s\": %m",
@@ -6518,7 +6521,7 @@ StartupXLOG(void)
 		if (haveTblspcMap)
 		{
 			unlink(TABLESPACE_MAP_OLD);
-			if (rename(TABLESPACE_MAP, TABLESPACE_MAP_OLD) != 0)
+			if (rename_safe(TABLESPACE_MAP, TABLESPACE_MAP_OLD) != 0)
 				ereport(FATAL,
 						(errcode_for_file_access(),
 						 errmsg("could not rename file \"%s\" to \"%s\": %m",
@@ -7299,7 +7302,7 @@ StartupXLOG(void)
 				 */
 				XLogArchiveCleanup(partialfname);
 
-				if (rename(origpath, partialpath) != 0)
+				if (rename_safe(origpath, partialpath) != 0)
 					ereport(ERROR,
 							(errcode_for_file_access(),
 						 errmsg("could not rename file \"%s\" to \"%s\": %m",
@@ -10863,7 +10866,7 @@ CancelBackup(void)
 	/* remove leftover file from previously canceled backup if it exists */
 	unlink(BACKUP_LABEL_OLD);
 
-	if (rename(BACKUP_LABEL_FILE, BACKUP_LABEL_OLD) != 0)
+	if (rename_safe(BACKUP_LABEL_FILE, BACKUP_LABEL_OLD) != 0)
 	{
 		ereport(WARNING,
 				(errcode_for_file_access(),
@@ -10886,7 +10889,7 @@ CancelBackup(void)
 	/* remove leftover file from previously canceled backup if it exists */
 	unlink(TABLESPACE_MAP_OLD);
 
-	if (rename(TABLESPACE_MAP, TABLESPACE_MAP_OLD) == 0)
+	if (rename_safe(TABLESPACE_MAP, TABLESPACE_MAP_OLD) == 0)
 	{
 		ereport(LOG,
 				(errmsg("online backup mode canceled"),
diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c
index 277c14a..13ab7fa 100644
--- a/src/backend/access/transam/xlogarchive.c
+++ b/src/backend/access/transam/xlogarchive.c
@@ -451,7 +451,7 @@ KeepFileRestoredFromArchive(char *path, char *xlogfname)
 		 */
 		snprintf(oldpath, MAXPGPATH, "%s.deleted%u",
 				 xlogfpath, deletedcounter++);
-		if (rename(xlogfpath, oldpath) != 0)
+		if (rename_safe(xlogfpath, oldpath) != 0)
 		{
 			ereport(ERROR,
 					(errcode_for_file_access(),
@@ -470,7 +470,7 @@ KeepFileRestoredFromArchive(char *path, char *xlogfname)
 		reload = true;
 	}
 
-	if (rename(path, xlogfpath) < 0)
+	if (rename_safe(path, xlogfpath) < 0)
 		ereport(ERROR,
 				(errcode_for_file_access(),
 				 errmsg("could not rename file \"%s\" to \"%s\": %m",
@@ -580,12 +580,11 @@ XLogArchiveForceDone(const char *xlog)
 	StatusFilePath(archiveReady, xlog, ".ready");
 	if (stat(archiveReady, &stat_buf) == 0)
 	{
-		if (rename(archiveReady, archiveDone) < 0)
+		if (rename_safe(archiveReady, archiveDone) < 0)
 			ereport(WARNING,
 					(errcode_for_file_access(),
 					 errmsg("could not rename file \"%s\" to \"%s\": %m",
 							archiveReady, archiveDone)));
-
 		return;
 	}
 
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 397f802..db54889 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -728,7 +728,7 @@ pgarch_archiveDone(char *xlog)
 
 	StatusFilePath(rlogready, xlog, ".ready");
 	StatusFilePath(rlogdone, xlog, ".done");
-	if (rename(rlogready, rlogdone) < 0)
+	if (rename_safe(rlogready, rlogdone) < 0)
 		ereport(WARNING,
 				(errcode_for_file_access(),
 				 errmsg("could not rename file \"%s\" to \"%s\": %m",
diff --git a/src/backend/replication/logical/origin.c b/src/backend/replication/logical/origin.c
index 757b50e..6327972 100644
--- a/src/backend/replication/logical/origin.c
+++ b/src/backend/replication/logical/origin.c
@@ -617,16 +617,13 @@ CheckPointReplicationOrigin(void)
 	CloseTransientFile(tmpfd);
 
 	/* rename to permanent file, fsync file and directory */
-	if (rename(tmppath, path) != 0)
+	if (rename_safe(tmppath, path) != 0)
 	{
 		ereport(PANIC,
 				(errcode_for_file_access(),
 				 errmsg("could not rename file \"%s\" to \"%s\": %m",
 						tmppath, path)));
 	}
-
-	fsync_fname((char *) path, false);
-	fsync_fname("pg_logical", true);
 }
 
 /*
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index ed823ec..1f17f42 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -1488,8 +1488,8 @@ SnapBuildSerialize(SnapBuild *builder, XLogRecPtr lsn)
 		 * That ought to be cheap because in most scenarios it should already
 		 * be safely on disk.
 		 */
-		fsync_fname(path, false);
-		fsync_fname("pg_logical/snapshots", true);
+		fsync_fname(path, false, false);
+		fsync_fname("pg_logical/snapshots", true, false);
 
 		builder->last_serialized_snapshot = lsn;
 		goto out;
@@ -1593,7 +1593,7 @@ SnapBuildSerialize(SnapBuild *builder, XLogRecPtr lsn)
 	}
 	CloseTransientFile(fd);
 
-	fsync_fname("pg_logical/snapshots", true);
+	fsync_fname("pg_logical/snapshots", true, false);
 
 	/*
 	 * We may overwrite the work from some other backend, but that's ok, our
@@ -1608,8 +1608,8 @@ SnapBuildSerialize(SnapBuild *builder, XLogRecPtr lsn)
 	}
 
 	/* make sure we persist */
-	fsync_fname(path, false);
-	fsync_fname("pg_logical/snapshots", true);
+	fsync_fname(path, false, false);
+	fsync_fname("pg_logical/snapshots", true, false);
 
 	/*
 	 * Now there's no way we can loose the dumped state anymore, remember this
@@ -1660,8 +1660,8 @@ SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn)
 	 * either...
 	 * ----
 	 */
-	fsync_fname(path, false);
-	fsync_fname("pg_logical/snapshots", true);
+	fsync_fname(path, false, false);
+	fsync_fname("pg_logical/snapshots", true, false);
 
 
 	/* read statically sized portion of snapshot */
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 251b549..65d01e8 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -453,8 +453,8 @@ ReplicationSlotDropAcquired(void)
 		 * restart.
 		 */
 		START_CRIT_SECTION();
-		fsync_fname(tmppath, true);
-		fsync_fname("pg_replslot", true);
+		fsync_fname(tmppath, true, false);
+		fsync_fname("pg_replslot", true, false);
 		END_CRIT_SECTION();
 	}
 	else
@@ -912,7 +912,7 @@ StartupReplicationSlots(void)
 						 errmsg("could not remove directory \"%s\"", path)));
 				continue;
 			}
-			fsync_fname("pg_replslot", true);
+			fsync_fname("pg_replslot", true, false);
 			continue;
 		}
 
@@ -968,7 +968,7 @@ CreateSlotOnDisk(ReplicationSlot *slot)
 				(errcode_for_file_access(),
 				 errmsg("could not create directory \"%s\": %m",
 						tmppath)));
-	fsync_fname(tmppath, true);
+	fsync_fname(tmppath, true, false);
 
 	/* Write the actual state file. */
 	slot->dirty = true;			/* signal that we really need to write */
@@ -988,8 +988,8 @@ CreateSlotOnDisk(ReplicationSlot *slot)
 	 */
 	START_CRIT_SECTION();
 
-	fsync_fname(path, true);
-	fsync_fname("pg_replslot", true);
+	fsync_fname(path, true, false);
+	fsync_fname("pg_replslot", true, false);
 
 	END_CRIT_SECTION();
 }
@@ -1094,9 +1094,9 @@ SaveSlotToPath(ReplicationSlot *slot, const char *dir, int elevel)
 	/* Check CreateSlot() for the reasoning of using a crit. section. */
 	START_CRIT_SECTION();
 
-	fsync_fname(path, false);
-	fsync_fname((char *) dir, true);
-	fsync_fname("pg_replslot", true);
+	fsync_fname(path, false, false);
+	fsync_fname((char *) dir, true, false);
+	fsync_fname("pg_replslot", true, false);
 
 	END_CRIT_SECTION();
 
@@ -1165,7 +1165,7 @@ RestoreSlotFromDisk(const char *name)
 
 	/* Also sync the parent directory */
 	START_CRIT_SECTION();
-	fsync_fname(path, true);
+	fsync_fname(path, true, false);
 	END_CRIT_SECTION();
 
 	/* read part of statefile that's guaranteed to be version independent */
@@ -1248,7 +1248,7 @@ RestoreSlotFromDisk(const char *name)
 					(errcode_for_file_access(),
 					 errmsg("could not remove directory \"%s\"", path)));
 		}
-		fsync_fname("pg_replslot", true);
+		fsync_fname("pg_replslot", true, false);
 		return;
 	}
 
diff --git a/src/backend/storage/file/copydir.c b/src/backend/storage/file/copydir.c
index 522f420..20f49ab 100644
--- a/src/backend/storage/file/copydir.c
+++ b/src/backend/storage/file/copydir.c
@@ -115,7 +115,7 @@ copydir(char *fromdir, char *todir, bool recurse)
 					 errmsg("could not stat file \"%s\": %m", tofile)));
 
 		if (S_ISREG(fst.st_mode))
-			fsync_fname(tofile, false);
+			fsync_fname(tofile, false, false);
 	}
 	FreeDir(xldir);
 
@@ -125,7 +125,7 @@ copydir(char *fromdir, char *todir, bool recurse)
 	 * synced. Recent versions of ext4 have made the window much wider but
 	 * it's been true for ext3 and other filesystems in the past.
 	 */
-	fsync_fname(todir, true);
+	fsync_fname(todir, true, false);
 }
 
 /*
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 1b30100..c0825e2 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -410,10 +410,11 @@ pg_flush_data(int fd, off_t offset, off_t amount)
  * fsync_fname -- fsync a file or directory, handling errors properly
  *
  * Try to fsync a file or directory. When doing the latter, ignore errors that
- * indicate the OS just doesn't allow/require fsyncing directories.
+ * indicate the OS just doesn't allow/require fsyncing directories. Optionally
+ * one can skip error regarding non-existing entries attempted to be fsync'ed.
  */
 void
-fsync_fname(char *fname, bool isdir)
+fsync_fname(char *fname, bool isdir, bool missing_ok)
 {
 	int			fd;
 	int			returncode;
@@ -440,9 +441,14 @@ fsync_fname(char *fname, bool isdir)
 		return;
 
 	else if (fd < 0)
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not open file \"%s\": %m", fname)));
+	{
+		if (missing_ok)
+			return;
+		else
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not open file \"%s\": %m", fname)));
+	}
 
 	returncode = pg_fsync(fd);
 
@@ -461,6 +467,52 @@ fsync_fname(char *fname, bool isdir)
 	CloseTransientFile(fd);
 }
 
+/*
+ * rename_safe -- rename of a file, making it on-disk persistent
+ *
+ * This routine ensures that a rename file persists in case of a crash by using
+ * fsync on the old and new files before and after performing the rename so as
+ * this categorizes as an all-or-nothing operation.
+ */
+int
+rename_safe(const char *old, const char *new)
+{
+	char	*parentpath;
+
+	/*
+	 * First fsync the old and new entries to ensure that they are properly
+	 * persistent on disk.
+	 */
+	fsync_fname(old, false, false);
+	fsync_fname(new, false, true);
+
+	/* Time to do the real deal... */
+	if (rename(old, new) != 0)
+		return -1;
+
+	/*
+	 * Make change persistent in case of an OS crash, both the new entry and
+	 * its parent directory need to be flushed.
+	 */
+	fsync_fname(new, false, false);
+
+	/* Same for parent directory */
+	parentpath = pstrdup(new);
+	get_parent_directory(parentpath);
+
+	/*
+	 * get_parent_directory() returns an empty string if the input argument
+	 * is just a file name (see comments in path.c), so handle that as being
+	 * the current directory.
+	 */
+	if (strlen(parentpath) == 0)
+		fsync_fname(".", true, false);
+	else
+		fsync_fname(parentpath, true, false);
+	pfree(parentpath);
+	return 0;
+}
+
 
 /*
  * InitFileAccess --- initialize this module during backend startup
diff --git a/src/backend/storage/file/reinit.c b/src/backend/storage/file/reinit.c
index 4ad85380..a3a0a77 100644
--- a/src/backend/storage/file/reinit.c
+++ b/src/backend/storage/file/reinit.c
@@ -380,12 +380,12 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
 					 dbspacedirname, oidbuf, de->d_name + oidchars + 1 +
 					 strlen(forkNames[INIT_FORKNUM]));
 
-			fsync_fname(mainpath, false);
+			fsync_fname(mainpath, false, false);
 		}
 
 		FreeDir(dbspace_dir);
 
-		fsync_fname((char *) dbspacedirname, true);
+		fsync_fname((char *) dbspacedirname, true, false);
 	}
 }
 
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 4a3fccb..24f04f9 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -113,7 +113,8 @@ extern int	pg_fsync_no_writethrough(int fd);
 extern int	pg_fsync_writethrough(int fd);
 extern int	pg_fdatasync(int fd);
 extern int	pg_flush_data(int fd, off_t offset, off_t amount);
-extern void fsync_fname(char *fname, bool isdir);
+extern void fsync_fname(char *fname, bool isdir, bool missing_ok);
+extern int	rename_safe(const char *old, const char *new);
 extern void SyncDataDirectory(void);
 
 /* Filename components for OpenTemporaryFile */
#43Michael Paquier
michael.paquier@gmail.com
In reply to: Michael Paquier (#42)
Re: silent data loss with ext4 / all current versions

On Thu, Feb 4, 2016 at 12:02 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Tue, Feb 2, 2016 at 4:20 PM, Michael Paquier wrote:

Not wrong, and this leads to the following:
void rename_safe(const char *old, const char *new, bool isdir, int elevel);
Controlling elevel is necessary per the multiple code paths that would
use it. Some use ERROR, most of them FATAL, and a bit of WARNING. Does
that look fine?

After really coding it, I finished with the following thing:
+int
+rename_safe(const char *old, const char *new)

There is no need to extend that for directories, well we could of
course but all the renames happen on files so I see no need to make
that more complicated. More refactoring of the other rename() calls
could be done as well by extending rename_safe() with a flag to fsync
the data within a critical section, particularly for the replication
slot code. I have let that out to not complicate more the patch.

Andres just poked me (2m far from each other now) regarding the fact
that we should fsync even after the link() calls when
HAVE_WORKING_LINK is used. So we could lose some meta data here. Mea
culpa. And the patch misses that.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#44Michael Paquier
michael.paquier@gmail.com
In reply to: Michael Paquier (#43)
1 attachment(s)
Re: silent data loss with ext4 / all current versions

On Thu, Feb 4, 2016 at 2:34 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Thu, Feb 4, 2016 at 12:02 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Tue, Feb 2, 2016 at 4:20 PM, Michael Paquier wrote:

Not wrong, and this leads to the following:
void rename_safe(const char *old, const char *new, bool isdir, int elevel);
Controlling elevel is necessary per the multiple code paths that would
use it. Some use ERROR, most of them FATAL, and a bit of WARNING. Does
that look fine?

After really coding it, I finished with the following thing:
+int
+rename_safe(const char *old, const char *new)

There is no need to extend that for directories, well we could of
course but all the renames happen on files so I see no need to make
that more complicated. More refactoring of the other rename() calls
could be done as well by extending rename_safe() with a flag to fsync
the data within a critical section, particularly for the replication
slot code. I have let that out to not complicate more the patch.

Andres just poked me (2m far from each other now) regarding the fact
that we should fsync even after the link() calls when
HAVE_WORKING_LINK is used. So we could lose some meta data here. Mea
culpa. And the patch misses that.

So, attached is an updated patch that adds a new routine link_safe()
to ensure that a hard link is on-disk persistent. link() is called
twice in timeline.c and once in xlog.c, so those three code paths are
impacted. I noticed as well that my previous patch was sometimes doing
palloc calls in a critical section (oops), I fixed that on the way.

Thoughts welcome.
--
Michael

Attachments:

xlog-fsync-v5.patchbinary/octet-stream; name=xlog-fsync-v5.patchDownload
diff --git a/src/backend/access/transam/timeline.c b/src/backend/access/transam/timeline.c
index f6da673..67e0f73 100644
--- a/src/backend/access/transam/timeline.c
+++ b/src/backend/access/transam/timeline.c
@@ -418,19 +418,20 @@ writeTimeLineHistory(TimeLineID newTLI, TimeLineID parentTLI,
 	TLHistoryFilePath(path, newTLI);
 
 	/*
-	 * Prefer link() to rename() here just to be really sure that we don't
-	 * overwrite an existing file.  However, there shouldn't be one, so
-	 * rename() is an acceptable substitute except for the truly paranoid.
+	 * Prefer link_safe() to rename_safe() here just to be really sure that
+	 * we don't overwrite an existing file.  However, there shouldn't be one,
+	 * so rename_safe() is an acceptable substitute except for the truly
+	 * paranoid.
 	 */
 #if HAVE_WORKING_LINK
-	if (link(tmppath, path) < 0)
+	if (link_safe(tmppath, path) < 0)
 		ereport(ERROR,
 				(errcode_for_file_access(),
 				 errmsg("could not link file \"%s\" to \"%s\": %m",
 						tmppath, path)));
 	unlink(tmppath);
 #else
-	if (rename(tmppath, path) < 0)
+	if (rename_safe(tmppath, path) < 0)
 		ereport(ERROR,
 				(errcode_for_file_access(),
 				 errmsg("could not rename file \"%s\" to \"%s\": %m",
@@ -508,19 +509,20 @@ writeTimeLineHistoryFile(TimeLineID tli, char *content, int size)
 	TLHistoryFilePath(path, tli);
 
 	/*
-	 * Prefer link() to rename() here just to be really sure that we don't
-	 * overwrite an existing logfile.  However, there shouldn't be one, so
-	 * rename() is an acceptable substitute except for the truly paranoid.
+	 * Prefer link_safe() to rename_safe() here just to be really sure that
+	 * we don't overwrite an existing logfile.  However, there shouldn't be
+	 * one, so rename_safe() is an acceptable substitute except for the
+	 * truly paranoid.
 	 */
 #if HAVE_WORKING_LINK
-	if (link(tmppath, path) < 0)
+	if (link_safe(tmppath, path) < 0)
 		ereport(ERROR,
 				(errcode_for_file_access(),
 				 errmsg("could not link file \"%s\" to \"%s\": %m",
 						tmppath, path)));
 	unlink(tmppath);
 #else
-	if (rename(tmppath, path) < 0)
+	if (rename_safe(tmppath, path) < 0)
 		ereport(ERROR,
 				(errcode_for_file_access(),
 				 errmsg("could not rename file \"%s\" to \"%s\": %m",
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index a2846c4..fc85c81 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -3251,12 +3251,13 @@ InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
 	}
 
 	/*
-	 * Prefer link() to rename() here just to be really sure that we don't
-	 * overwrite an existing logfile.  However, there shouldn't be one, so
-	 * rename() is an acceptable substitute except for the truly paranoid.
+	 * Prefer link_safe() to rename_safe() here just to be really sure
+	 * that we don't overwrite an existing logfile.  However, there
+	 * shouldn't be one, so rename_safe() is an acceptable substitute
+	 * except for the truly paranoid.
 	 */
 #if HAVE_WORKING_LINK
-	if (link(tmppath, path) < 0)
+	if (link_safe(tmppath, path) < 0)
 	{
 		if (use_lock)
 			LWLockRelease(ControlFileLock);
@@ -3268,7 +3269,7 @@ InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
 	}
 	unlink(tmppath);
 #else
-	if (rename(tmppath, path) < 0)
+	if (rename_safe(tmppath, path) < 0)
 	{
 		if (use_lock)
 			LWLockRelease(ControlFileLock);
@@ -3792,7 +3793,7 @@ RemoveXlogFile(const char *segname, XLogRecPtr PriorRedoPtr, XLogRecPtr endptr)
 		 * flag, rename will fail. We'll try again at the next checkpoint.
 		 */
 		snprintf(newpath, MAXPGPATH, "%s.deleted", path);
-		if (rename(path, newpath) != 0)
+		if (rename_safe(path, newpath) != 0)
 		{
 			ereport(LOG,
 					(errcode_for_file_access(),
@@ -3800,10 +3801,12 @@ RemoveXlogFile(const char *segname, XLogRecPtr PriorRedoPtr, XLogRecPtr endptr)
 					  path)));
 			return;
 		}
+
 		rc = unlink(newpath);
 #else
 		rc = unlink(path);
 #endif
+
 		if (rc != 0)
 		{
 			ereport(LOG,
@@ -5291,7 +5294,7 @@ exitArchiveRecovery(TimeLineID endTLI, XLogRecPtr endOfLog)
 	 * re-enter archive recovery mode in a subsequent crash.
 	 */
 	unlink(RECOVERY_COMMAND_DONE);
-	if (rename(RECOVERY_COMMAND_FILE, RECOVERY_COMMAND_DONE) != 0)
+	if (rename_safe(RECOVERY_COMMAND_FILE, RECOVERY_COMMAND_DONE) != 0)
 		ereport(FATAL,
 				(errcode_for_file_access(),
 				 errmsg("could not rename file \"%s\" to \"%s\": %m",
@@ -6138,7 +6141,7 @@ StartupXLOG(void)
 		if (stat(TABLESPACE_MAP, &st) == 0)
 		{
 			unlink(TABLESPACE_MAP_OLD);
-			if (rename(TABLESPACE_MAP, TABLESPACE_MAP_OLD) == 0)
+			if (rename_safe(TABLESPACE_MAP, TABLESPACE_MAP_OLD) == 0)
 				ereport(LOG,
 					(errmsg("ignoring file \"%s\" because no file \"%s\" exists",
 							TABLESPACE_MAP, BACKUP_LABEL_FILE),
@@ -6501,7 +6504,7 @@ StartupXLOG(void)
 		if (haveBackupLabel)
 		{
 			unlink(BACKUP_LABEL_OLD);
-			if (rename(BACKUP_LABEL_FILE, BACKUP_LABEL_OLD) != 0)
+			if (rename_safe(BACKUP_LABEL_FILE, BACKUP_LABEL_OLD) != 0)
 				ereport(FATAL,
 						(errcode_for_file_access(),
 						 errmsg("could not rename file \"%s\" to \"%s\": %m",
@@ -6518,7 +6521,7 @@ StartupXLOG(void)
 		if (haveTblspcMap)
 		{
 			unlink(TABLESPACE_MAP_OLD);
-			if (rename(TABLESPACE_MAP, TABLESPACE_MAP_OLD) != 0)
+			if (rename_safe(TABLESPACE_MAP, TABLESPACE_MAP_OLD) != 0)
 				ereport(FATAL,
 						(errcode_for_file_access(),
 						 errmsg("could not rename file \"%s\" to \"%s\": %m",
@@ -7299,7 +7302,7 @@ StartupXLOG(void)
 				 */
 				XLogArchiveCleanup(partialfname);
 
-				if (rename(origpath, partialpath) != 0)
+				if (rename_safe(origpath, partialpath) != 0)
 					ereport(ERROR,
 							(errcode_for_file_access(),
 						 errmsg("could not rename file \"%s\" to \"%s\": %m",
@@ -10863,7 +10866,7 @@ CancelBackup(void)
 	/* remove leftover file from previously canceled backup if it exists */
 	unlink(BACKUP_LABEL_OLD);
 
-	if (rename(BACKUP_LABEL_FILE, BACKUP_LABEL_OLD) != 0)
+	if (rename_safe(BACKUP_LABEL_FILE, BACKUP_LABEL_OLD) != 0)
 	{
 		ereport(WARNING,
 				(errcode_for_file_access(),
@@ -10886,7 +10889,7 @@ CancelBackup(void)
 	/* remove leftover file from previously canceled backup if it exists */
 	unlink(TABLESPACE_MAP_OLD);
 
-	if (rename(TABLESPACE_MAP, TABLESPACE_MAP_OLD) == 0)
+	if (rename_safe(TABLESPACE_MAP, TABLESPACE_MAP_OLD) == 0)
 	{
 		ereport(LOG,
 				(errmsg("online backup mode canceled"),
diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c
index 277c14a..13ab7fa 100644
--- a/src/backend/access/transam/xlogarchive.c
+++ b/src/backend/access/transam/xlogarchive.c
@@ -451,7 +451,7 @@ KeepFileRestoredFromArchive(char *path, char *xlogfname)
 		 */
 		snprintf(oldpath, MAXPGPATH, "%s.deleted%u",
 				 xlogfpath, deletedcounter++);
-		if (rename(xlogfpath, oldpath) != 0)
+		if (rename_safe(xlogfpath, oldpath) != 0)
 		{
 			ereport(ERROR,
 					(errcode_for_file_access(),
@@ -470,7 +470,7 @@ KeepFileRestoredFromArchive(char *path, char *xlogfname)
 		reload = true;
 	}
 
-	if (rename(path, xlogfpath) < 0)
+	if (rename_safe(path, xlogfpath) < 0)
 		ereport(ERROR,
 				(errcode_for_file_access(),
 				 errmsg("could not rename file \"%s\" to \"%s\": %m",
@@ -580,12 +580,11 @@ XLogArchiveForceDone(const char *xlog)
 	StatusFilePath(archiveReady, xlog, ".ready");
 	if (stat(archiveReady, &stat_buf) == 0)
 	{
-		if (rename(archiveReady, archiveDone) < 0)
+		if (rename_safe(archiveReady, archiveDone) < 0)
 			ereport(WARNING,
 					(errcode_for_file_access(),
 					 errmsg("could not rename file \"%s\" to \"%s\": %m",
 							archiveReady, archiveDone)));
-
 		return;
 	}
 
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 397f802..db54889 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -728,7 +728,7 @@ pgarch_archiveDone(char *xlog)
 
 	StatusFilePath(rlogready, xlog, ".ready");
 	StatusFilePath(rlogdone, xlog, ".done");
-	if (rename(rlogready, rlogdone) < 0)
+	if (rename_safe(rlogready, rlogdone) < 0)
 		ereport(WARNING,
 				(errcode_for_file_access(),
 				 errmsg("could not rename file \"%s\" to \"%s\": %m",
diff --git a/src/backend/replication/logical/origin.c b/src/backend/replication/logical/origin.c
index 5af47ec..e1497a9 100644
--- a/src/backend/replication/logical/origin.c
+++ b/src/backend/replication/logical/origin.c
@@ -617,16 +617,13 @@ CheckPointReplicationOrigin(void)
 	CloseTransientFile(tmpfd);
 
 	/* rename to permanent file, fsync file and directory */
-	if (rename(tmppath, path) != 0)
+	if (rename_safe((char *) tmppath, (char *) path) != 0)
 	{
 		ereport(PANIC,
 				(errcode_for_file_access(),
 				 errmsg("could not rename file \"%s\" to \"%s\": %m",
 						tmppath, path)));
 	}
-
-	fsync_fname((char *) path, false);
-	fsync_fname("pg_logical", true);
 }
 
 /*
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index ed823ec..1f17f42 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -1488,8 +1488,8 @@ SnapBuildSerialize(SnapBuild *builder, XLogRecPtr lsn)
 		 * That ought to be cheap because in most scenarios it should already
 		 * be safely on disk.
 		 */
-		fsync_fname(path, false);
-		fsync_fname("pg_logical/snapshots", true);
+		fsync_fname(path, false, false);
+		fsync_fname("pg_logical/snapshots", true, false);
 
 		builder->last_serialized_snapshot = lsn;
 		goto out;
@@ -1593,7 +1593,7 @@ SnapBuildSerialize(SnapBuild *builder, XLogRecPtr lsn)
 	}
 	CloseTransientFile(fd);
 
-	fsync_fname("pg_logical/snapshots", true);
+	fsync_fname("pg_logical/snapshots", true, false);
 
 	/*
 	 * We may overwrite the work from some other backend, but that's ok, our
@@ -1608,8 +1608,8 @@ SnapBuildSerialize(SnapBuild *builder, XLogRecPtr lsn)
 	}
 
 	/* make sure we persist */
-	fsync_fname(path, false);
-	fsync_fname("pg_logical/snapshots", true);
+	fsync_fname(path, false, false);
+	fsync_fname("pg_logical/snapshots", true, false);
 
 	/*
 	 * Now there's no way we can loose the dumped state anymore, remember this
@@ -1660,8 +1660,8 @@ SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn)
 	 * either...
 	 * ----
 	 */
-	fsync_fname(path, false);
-	fsync_fname("pg_logical/snapshots", true);
+	fsync_fname(path, false, false);
+	fsync_fname("pg_logical/snapshots", true, false);
 
 
 	/* read statically sized portion of snapshot */
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 251b549..65d01e8 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -453,8 +453,8 @@ ReplicationSlotDropAcquired(void)
 		 * restart.
 		 */
 		START_CRIT_SECTION();
-		fsync_fname(tmppath, true);
-		fsync_fname("pg_replslot", true);
+		fsync_fname(tmppath, true, false);
+		fsync_fname("pg_replslot", true, false);
 		END_CRIT_SECTION();
 	}
 	else
@@ -912,7 +912,7 @@ StartupReplicationSlots(void)
 						 errmsg("could not remove directory \"%s\"", path)));
 				continue;
 			}
-			fsync_fname("pg_replslot", true);
+			fsync_fname("pg_replslot", true, false);
 			continue;
 		}
 
@@ -968,7 +968,7 @@ CreateSlotOnDisk(ReplicationSlot *slot)
 				(errcode_for_file_access(),
 				 errmsg("could not create directory \"%s\": %m",
 						tmppath)));
-	fsync_fname(tmppath, true);
+	fsync_fname(tmppath, true, false);
 
 	/* Write the actual state file. */
 	slot->dirty = true;			/* signal that we really need to write */
@@ -988,8 +988,8 @@ CreateSlotOnDisk(ReplicationSlot *slot)
 	 */
 	START_CRIT_SECTION();
 
-	fsync_fname(path, true);
-	fsync_fname("pg_replslot", true);
+	fsync_fname(path, true, false);
+	fsync_fname("pg_replslot", true, false);
 
 	END_CRIT_SECTION();
 }
@@ -1094,9 +1094,9 @@ SaveSlotToPath(ReplicationSlot *slot, const char *dir, int elevel)
 	/* Check CreateSlot() for the reasoning of using a crit. section. */
 	START_CRIT_SECTION();
 
-	fsync_fname(path, false);
-	fsync_fname((char *) dir, true);
-	fsync_fname("pg_replslot", true);
+	fsync_fname(path, false, false);
+	fsync_fname((char *) dir, true, false);
+	fsync_fname("pg_replslot", true, false);
 
 	END_CRIT_SECTION();
 
@@ -1165,7 +1165,7 @@ RestoreSlotFromDisk(const char *name)
 
 	/* Also sync the parent directory */
 	START_CRIT_SECTION();
-	fsync_fname(path, true);
+	fsync_fname(path, true, false);
 	END_CRIT_SECTION();
 
 	/* read part of statefile that's guaranteed to be version independent */
@@ -1248,7 +1248,7 @@ RestoreSlotFromDisk(const char *name)
 					(errcode_for_file_access(),
 					 errmsg("could not remove directory \"%s\"", path)));
 		}
-		fsync_fname("pg_replslot", true);
+		fsync_fname("pg_replslot", true, false);
 		return;
 	}
 
diff --git a/src/backend/storage/file/copydir.c b/src/backend/storage/file/copydir.c
index 522f420..20f49ab 100644
--- a/src/backend/storage/file/copydir.c
+++ b/src/backend/storage/file/copydir.c
@@ -115,7 +115,7 @@ copydir(char *fromdir, char *todir, bool recurse)
 					 errmsg("could not stat file \"%s\": %m", tofile)));
 
 		if (S_ISREG(fst.st_mode))
-			fsync_fname(tofile, false);
+			fsync_fname(tofile, false, false);
 	}
 	FreeDir(xldir);
 
@@ -125,7 +125,7 @@ copydir(char *fromdir, char *todir, bool recurse)
 	 * synced. Recent versions of ext4 have made the window much wider but
 	 * it's been true for ext3 and other filesystems in the past.
 	 */
-	fsync_fname(todir, true);
+	fsync_fname(todir, true, false);
 }
 
 /*
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 1b30100..bd268bd 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -307,6 +307,7 @@ static void walkdir(const char *path,
 static void pre_sync_fname(const char *fname, bool isdir, int elevel);
 #endif
 static void fsync_fname_ext(const char *fname, bool isdir, int elevel);
+static void fsync_parent_path(const char *fname);
 
 
 /*
@@ -410,10 +411,11 @@ pg_flush_data(int fd, off_t offset, off_t amount)
  * fsync_fname -- fsync a file or directory, handling errors properly
  *
  * Try to fsync a file or directory. When doing the latter, ignore errors that
- * indicate the OS just doesn't allow/require fsyncing directories.
+ * indicate the OS just doesn't allow/require fsyncing directories. Optionally
+ * one can skip error regarding non-existing entries attempted to be fsync'ed.
  */
 void
-fsync_fname(char *fname, bool isdir)
+fsync_fname(char *fname, bool isdir, bool missing_ok)
 {
 	int			fd;
 	int			returncode;
@@ -440,9 +442,14 @@ fsync_fname(char *fname, bool isdir)
 		return;
 
 	else if (fd < 0)
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not open file \"%s\": %m", fname)));
+	{
+		if (missing_ok)
+			return;
+		else
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not open file \"%s\": %m", fname)));
+	}
 
 	returncode = pg_fsync(fd);
 
@@ -461,6 +468,62 @@ fsync_fname(char *fname, bool isdir)
 	CloseTransientFile(fd);
 }
 
+/*
+ * rename_safe -- rename of a file, making it on-disk persistent
+ *
+ * This routine ensures that a rename file persists in case of a crash by using
+ * fsync on the old and new files before and after performing the rename so as
+ * this categorizes as an all-or-nothing operation.
+ */
+int
+rename_safe(char *oldfile, char *newfile)
+{
+	/*
+	 * First fsync the old and new entries to ensure that they are properly
+	 * persistent on disk.
+	 */
+	fsync_fname(oldfile, false, false);
+	fsync_fname(newfile, false, true);
+
+	/* Time to do the real deal... */
+	if (rename(oldfile, newfile) != 0)
+		return -1;
+
+	/*
+	 * Make change persistent in case of an OS crash, both the new entry and
+	 * its parent directory need to be flushed.
+	 */
+	fsync_fname(newfile, false, false);
+
+	/* Same for parent directory */
+	fsync_parent_path(newfile);
+	return 0;
+}
+
+/*
+ * link_safe -- make a file hard link, making it on-disk persistent
+ *
+ * This routine ensures that a hard link created on a file persists on system
+ * in case of a crash by using fsync where on the link generated as well as on
+ * its parent directory.
+ */
+int
+link_safe(char *oldfile, char *newfile)
+{
+	if (link(oldfile, newfile) < 0)
+		return -1;
+
+	/*
+	 * Make the link persistent in case of an OS crash, the new entry
+	 * generated as well as its parent directory need to be flushed.
+	 */
+	fsync_fname(newfile, false, false);
+
+	/* Same for parent directory */
+	fsync_parent_path(newfile);
+	return 0;
+}
+
 
 /*
  * InitFileAccess --- initialize this module during backend startup
@@ -2719,3 +2782,29 @@ fsync_fname_ext(const char *fname, bool isdir, int elevel)
 
 	(void) CloseTransientFile(fd);
 }
+
+/*
+ * fsync_parent_path -- fsync the parent path of a file or directory
+ *
+ * This is aimed at making file operations persistent on disk in case of
+ * an OS crash or power failure.
+ */
+static void
+fsync_parent_path(const char *fname)
+{
+	char	parentpath[MAXPGPATH];
+
+	/* Same for parent directory */
+	snprintf(parentpath, MAXPGPATH, "%s", fname);
+	get_parent_directory(parentpath);
+
+	/*
+	 * get_parent_directory() returns an empty string if the input argument
+	 * is just a file name (see comments in path.c), so handle that as being
+	 * the current directory.
+	 */
+	if (strlen(parentpath) == 0)
+		fsync_fname(".", true, false);
+	else
+		fsync_fname(parentpath, true, false);
+}
diff --git a/src/backend/storage/file/reinit.c b/src/backend/storage/file/reinit.c
index 4ad85380..a3a0a77 100644
--- a/src/backend/storage/file/reinit.c
+++ b/src/backend/storage/file/reinit.c
@@ -380,12 +380,12 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
 					 dbspacedirname, oidbuf, de->d_name + oidchars + 1 +
 					 strlen(forkNames[INIT_FORKNUM]));
 
-			fsync_fname(mainpath, false);
+			fsync_fname(mainpath, false, false);
 		}
 
 		FreeDir(dbspace_dir);
 
-		fsync_fname((char *) dbspacedirname, true);
+		fsync_fname((char *) dbspacedirname, true, false);
 	}
 }
 
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 4a3fccb..8ad5207 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -113,7 +113,9 @@ extern int	pg_fsync_no_writethrough(int fd);
 extern int	pg_fsync_writethrough(int fd);
 extern int	pg_fdatasync(int fd);
 extern int	pg_flush_data(int fd, off_t offset, off_t amount);
-extern void fsync_fname(char *fname, bool isdir);
+extern void fsync_fname(char *fname, bool isdir, bool missing_ok);
+extern int	rename_safe(char *oldfile, char *newfile);
+extern int	link_safe(char *oldfile, char *newfile);
 extern void SyncDataDirectory(void);
 
 /* Filename components for OpenTemporaryFile */
#45Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Michael Paquier (#41)
Re: silent data loss with ext4 / all current versions

On 02/04/2016 09:59 AM, Michael Paquier wrote:

On Tue, Feb 2, 2016 at 9:59 AM, Andres Freund <andres@anarazel.de> wrote:

On 2016-02-02 09:56:40 +0900, Michael Paquier wrote:

And there is no actual risk of data loss

Huh?

More precise: what I mean here is that should an OS crash or a power
failure happen, we would fall back to recovery at next restart, so we
would not actually *lose* data.

Except that we actually can't perform the recovery properly because we
may not have the last WAL segment (or multiple segments), so we can't
replay the last batch of transactions. And we don't even notice that.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#46Michael Paquier
michael.paquier@gmail.com
In reply to: Tomas Vondra (#45)
Re: silent data loss with ext4 / all current versions

On Sat, Feb 6, 2016 at 2:11 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On 02/04/2016 09:59 AM, Michael Paquier wrote:

On Tue, Feb 2, 2016 at 9:59 AM, Andres Freund <andres@anarazel.de> wrote:

On 2016-02-02 09:56:40 +0900, Michael Paquier wrote:

And there is no actual risk of data loss

Huh?

More precise: what I mean here is that should an OS crash or a power
failure happen, we would fall back to recovery at next restart, so we
would not actually *lose* data.

Except that we actually can't perform the recovery properly because we may
not have the last WAL segment (or multiple segments), so we can't replay the
last batch of transactions. And we don't even notice that.

Still the data is here... But well. I won't insist. Tomas, could you
have a look at the latest patch I wrote? It would be good to get fresh
eyes on it. We could work on a version for ~9.4 once we have a clean
approach for master/9.5.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#47Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Michael Paquier (#46)
Re: silent data loss with ext4 / all current versions

Hi,

On 02/06/2016 01:16 PM, Michael Paquier wrote:

On Sat, Feb 6, 2016 at 2:11 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On 02/04/2016 09:59 AM, Michael Paquier wrote:

On Tue, Feb 2, 2016 at 9:59 AM, Andres Freund <andres@anarazel.de> wrote:

On 2016-02-02 09:56:40 +0900, Michael Paquier wrote:

And there is no actual risk of data loss

Huh?

More precise: what I mean here is that should an OS crash or a
power failure happen, we would fall back to recovery at next
restart, so we would not actually *lose* data.

Except that we actually can't perform the recovery properly
because we may not have the last WAL segment (or multiple
segments), so we can't replay the last batch of transactions. And
we don't even notice that.

Still the data is here... But well. I won't insist.

Huh? This thread started by an example how to cause loss of committed
transactions. That fits my definition of "data loss" quite well.

Tomas, could you have a look at the latest patch I wrote? It would be
good to get fresh eyes on it. We could work on a version for ~9.4
once we have a clean approach for master/9.5.

Yep, I'll take a look - I've been out of office for the past 2 weeks,
but I've been following the discussion and I agree with the changes
discussed there (e.g. adding safe_rename and such).

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#48Andres Freund
andres@anarazel.de
In reply to: Tomas Vondra (#47)
Re: silent data loss with ext4 / all current versions

On 2016-02-06 17:43:48 +0100, Tomas Vondra wrote:

Still the data is here... But well. I won't insist.

Huh? This thread started by an example how to cause loss of committed
transactions. That fits my definition of "data loss" quite well.

Agreed, that view doesn't seem to make much sense. This clearly is a
data loss issue.

Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#49Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Michael Paquier (#44)
Re: silent data loss with ext4 / all current versions

Hi,

On 02/05/2016 10:40 AM, Michael Paquier wrote:

On Thu, Feb 4, 2016 at 2:34 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Thu, Feb 4, 2016 at 12:02 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Tue, Feb 2, 2016 at 4:20 PM, Michael Paquier wrote:

...

So, attached is an updated patch that adds a new routine link_safe()
to ensure that a hard link is on-disk persistent. link() is called
twice in timeline.c and once in xlog.c, so those three code paths
are impacted. I noticed as well that my previous patch was sometimes
doing palloc calls in a critical section (oops), I fixed that on the
way.

I've finally got around to review the v5 version of the patch. Sorry it
took so long (I blame FOSDEM, country-wide flu epidemic and my general
laziness).

I do like most of the changes to the patch, thanks for improving it. A
few comments though:

1) I'm not quite sure why the patch adds missing_ok to fsync_fname()?
The only place where we use missing_ok=true is in rename_safe, where
right at the beginning we do this:

fsync_fname(newfile, false, true);

I.e. we're fsyncing the rename target first, in case it exists. But that
seems to be conflicting with the comments in xlog.c where we explicitly
state that the target file should not exist. Why should it be OK to call
rename_safe() when the target already exists? If this really is the
right thing to do, it should be explained in the comment above
rename_safe(), probably.

2) If rename_safe/link_safe are meant as crash-safe replacements for
rename/link, then perhaps we should use the same signatures, including
the "const" pointer parameters. So while currently the signatures look
like this:

int rename_safe(char *oldfile, char *newfile);
int link_safe(char *oldfile, char *newfile);

it should probably look like this

int rename_safe(const char *oldfile, const char *newfile);
int link_safe(const char *oldfile, const char *newfile);

I've noticed this in CheckPointReplicationOrigin() where the current
code has to cast the parameters to (char*) to silence the compiler.

3) Both rename_safe and link_safe do this at the very end:

fsync_parent_path(newfile);

That however assumes both the oldfile and newfile are placed in the same
directory - otherwise we'd fsync only one of them. I don't think we have
a place where we're renaming files between directories (or do we), so
we're OK with respect to this. But it seems like a good idea to defend
against this, or at least mention that in the comments.

4) nitpicking: There are some unnecessary newlines added/removed in
RemoveXlogFile, XLogArchiveForceDone.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#50Michael Paquier
michael.paquier@gmail.com
In reply to: Tomas Vondra (#49)
1 attachment(s)
Re: silent data loss with ext4 / all current versions

On Wed, Feb 24, 2016 at 7:26 AM, Tomas Vondra wrote:

1) I'm not quite sure why the patch adds missing_ok to fsync_fname()? The
only place where we use missing_ok=true is in rename_safe, where right at
the beginning we do this:

fsync_fname(newfile, false, true);

I.e. we're fsyncing the rename target first, in case it exists. But that
seems to be conflicting with the comments in xlog.c where we explicitly
state that the target file should not exist. Why should it be OK to call
rename_safe() when the target already exists? If this really is the right
thing to do, it should be explained in the comment above rename_safe(),
probably.

The point is to mimic rename(), which can manage the case where the
new entry exists, and to look for a consistent on-disk behavior. It is
true that the new argument of fsync_fname is actually not necessary,
we could just check for the existence of the new entry with stat(),
and perform an fsync if it exists.
I have added the following comment in rename_safe():
+   /*
+    * First fsync the old entry and new entry, it this one exists, to ensure
+    * that they are properly persistent on disk. Calling this routine with
+    * an existing new target file is fine, rename() will first remove it
+    * before performing its operation.
+    */
How does that look?

2) If rename_safe/link_safe are meant as crash-safe replacements for
rename/link, then perhaps we should use the same signatures, including the
"const" pointer parameters. So while currently the signatures look like
this:

int rename_safe(char *oldfile, char *newfile);
int link_safe(char *oldfile, char *newfile);

it should probably look like this

int rename_safe(const char *oldfile, const char *newfile);
int link_safe(const char *oldfile, const char *newfile);

I've noticed this in CheckPointReplicationOrigin() where the current code
has to cast the parameters to (char*) to silence the compiler.

I recall considering that, and the reason why I did not do so was that
fsync_fname() is not doing it either, because OpenTransientFile() is
using FileName which is not defined as a constant. At the end we'd
finish with one or two casts anyway. So it seems that changing
fsync_fname makes sense though by looking at fsync_fname_ext,
OpenTransientFile() is using an explicit cast for the file name.

3) Both rename_safe and link_safe do this at the very end:

fsync_parent_path(newfile);

That however assumes both the oldfile and newfile are placed in the same
directory - otherwise we'd fsync only one of them. I don't think we have a
place where we're renaming files between directories (or do we), so we're OK
with respect to this. But it seems like a good idea to defend against this,
or at least mention that in the comments.

No, we don't have a code path where a file is renamed between
different directories, which is why this code is doing so. The comment
is a good addition to have though, so I have added it. I guess that we
could complicate more this patch to check if the parent directories of
the new and old entries match or not, then fsync both of them, but I'd
rather keep things simple.

4) nitpicking: There are some unnecessary newlines added/removed in
RemoveXlogFile, XLogArchiveForceDone.

Fixed. I missed those three ones.
--
Michael

Attachments:

xlog-fsync-v6.patchtext/x-patch; charset=US-ASCII; name=xlog-fsync-v6.patchDownload
diff --git a/src/backend/access/transam/timeline.c b/src/backend/access/transam/timeline.c
index f6da673..67e0f73 100644
--- a/src/backend/access/transam/timeline.c
+++ b/src/backend/access/transam/timeline.c
@@ -418,19 +418,20 @@ writeTimeLineHistory(TimeLineID newTLI, TimeLineID parentTLI,
 	TLHistoryFilePath(path, newTLI);
 
 	/*
-	 * Prefer link() to rename() here just to be really sure that we don't
-	 * overwrite an existing file.  However, there shouldn't be one, so
-	 * rename() is an acceptable substitute except for the truly paranoid.
+	 * Prefer link_safe() to rename_safe() here just to be really sure that
+	 * we don't overwrite an existing file.  However, there shouldn't be one,
+	 * so rename_safe() is an acceptable substitute except for the truly
+	 * paranoid.
 	 */
 #if HAVE_WORKING_LINK
-	if (link(tmppath, path) < 0)
+	if (link_safe(tmppath, path) < 0)
 		ereport(ERROR,
 				(errcode_for_file_access(),
 				 errmsg("could not link file \"%s\" to \"%s\": %m",
 						tmppath, path)));
 	unlink(tmppath);
 #else
-	if (rename(tmppath, path) < 0)
+	if (rename_safe(tmppath, path) < 0)
 		ereport(ERROR,
 				(errcode_for_file_access(),
 				 errmsg("could not rename file \"%s\" to \"%s\": %m",
@@ -508,19 +509,20 @@ writeTimeLineHistoryFile(TimeLineID tli, char *content, int size)
 	TLHistoryFilePath(path, tli);
 
 	/*
-	 * Prefer link() to rename() here just to be really sure that we don't
-	 * overwrite an existing logfile.  However, there shouldn't be one, so
-	 * rename() is an acceptable substitute except for the truly paranoid.
+	 * Prefer link_safe() to rename_safe() here just to be really sure that
+	 * we don't overwrite an existing logfile.  However, there shouldn't be
+	 * one, so rename_safe() is an acceptable substitute except for the
+	 * truly paranoid.
 	 */
 #if HAVE_WORKING_LINK
-	if (link(tmppath, path) < 0)
+	if (link_safe(tmppath, path) < 0)
 		ereport(ERROR,
 				(errcode_for_file_access(),
 				 errmsg("could not link file \"%s\" to \"%s\": %m",
 						tmppath, path)));
 	unlink(tmppath);
 #else
-	if (rename(tmppath, path) < 0)
+	if (rename_safe(tmppath, path) < 0)
 		ereport(ERROR,
 				(errcode_for_file_access(),
 				 errmsg("could not rename file \"%s\" to \"%s\": %m",
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 94b79ac..93b6896 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -3299,12 +3299,13 @@ InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
 	}
 
 	/*
-	 * Prefer link() to rename() here just to be really sure that we don't
-	 * overwrite an existing logfile.  However, there shouldn't be one, so
-	 * rename() is an acceptable substitute except for the truly paranoid.
+	 * Prefer link_safe() to rename_safe() here just to be really sure
+	 * that we don't overwrite an existing logfile.  However, there
+	 * shouldn't be one, so rename_safe() is an acceptable substitute
+	 * except for the truly paranoid.
 	 */
 #if HAVE_WORKING_LINK
-	if (link(tmppath, path) < 0)
+	if (link_safe(tmppath, path) < 0)
 	{
 		if (use_lock)
 			LWLockRelease(ControlFileLock);
@@ -3316,7 +3317,7 @@ InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
 	}
 	unlink(tmppath);
 #else
-	if (rename(tmppath, path) < 0)
+	if (rename_safe(tmppath, path) < 0)
 	{
 		if (use_lock)
 			LWLockRelease(ControlFileLock);
@@ -3840,7 +3841,7 @@ RemoveXlogFile(const char *segname, XLogRecPtr PriorRedoPtr, XLogRecPtr endptr)
 		 * flag, rename will fail. We'll try again at the next checkpoint.
 		 */
 		snprintf(newpath, MAXPGPATH, "%s.deleted", path);
-		if (rename(path, newpath) != 0)
+		if (rename_safe(path, newpath) != 0)
 		{
 			ereport(LOG,
 					(errcode_for_file_access(),
@@ -5339,7 +5340,7 @@ exitArchiveRecovery(TimeLineID endTLI, XLogRecPtr endOfLog)
 	 * re-enter archive recovery mode in a subsequent crash.
 	 */
 	unlink(RECOVERY_COMMAND_DONE);
-	if (rename(RECOVERY_COMMAND_FILE, RECOVERY_COMMAND_DONE) != 0)
+	if (rename_safe(RECOVERY_COMMAND_FILE, RECOVERY_COMMAND_DONE) != 0)
 		ereport(FATAL,
 				(errcode_for_file_access(),
 				 errmsg("could not rename file \"%s\" to \"%s\": %m",
@@ -6186,7 +6187,7 @@ StartupXLOG(void)
 		if (stat(TABLESPACE_MAP, &st) == 0)
 		{
 			unlink(TABLESPACE_MAP_OLD);
-			if (rename(TABLESPACE_MAP, TABLESPACE_MAP_OLD) == 0)
+			if (rename_safe(TABLESPACE_MAP, TABLESPACE_MAP_OLD) == 0)
 				ereport(LOG,
 					(errmsg("ignoring file \"%s\" because no file \"%s\" exists",
 							TABLESPACE_MAP, BACKUP_LABEL_FILE),
@@ -6549,7 +6550,7 @@ StartupXLOG(void)
 		if (haveBackupLabel)
 		{
 			unlink(BACKUP_LABEL_OLD);
-			if (rename(BACKUP_LABEL_FILE, BACKUP_LABEL_OLD) != 0)
+			if (rename_safe(BACKUP_LABEL_FILE, BACKUP_LABEL_OLD) != 0)
 				ereport(FATAL,
 						(errcode_for_file_access(),
 						 errmsg("could not rename file \"%s\" to \"%s\": %m",
@@ -6566,7 +6567,7 @@ StartupXLOG(void)
 		if (haveTblspcMap)
 		{
 			unlink(TABLESPACE_MAP_OLD);
-			if (rename(TABLESPACE_MAP, TABLESPACE_MAP_OLD) != 0)
+			if (rename_safe(TABLESPACE_MAP, TABLESPACE_MAP_OLD) != 0)
 				ereport(FATAL,
 						(errcode_for_file_access(),
 						 errmsg("could not rename file \"%s\" to \"%s\": %m",
@@ -7347,7 +7348,7 @@ StartupXLOG(void)
 				 */
 				XLogArchiveCleanup(partialfname);
 
-				if (rename(origpath, partialpath) != 0)
+				if (rename_safe(origpath, partialpath) != 0)
 					ereport(ERROR,
 							(errcode_for_file_access(),
 						 errmsg("could not rename file \"%s\" to \"%s\": %m",
@@ -10907,7 +10908,7 @@ CancelBackup(void)
 	/* remove leftover file from previously canceled backup if it exists */
 	unlink(BACKUP_LABEL_OLD);
 
-	if (rename(BACKUP_LABEL_FILE, BACKUP_LABEL_OLD) != 0)
+	if (rename_safe(BACKUP_LABEL_FILE, BACKUP_LABEL_OLD) != 0)
 	{
 		ereport(WARNING,
 				(errcode_for_file_access(),
@@ -10930,7 +10931,7 @@ CancelBackup(void)
 	/* remove leftover file from previously canceled backup if it exists */
 	unlink(TABLESPACE_MAP_OLD);
 
-	if (rename(TABLESPACE_MAP, TABLESPACE_MAP_OLD) == 0)
+	if (rename_safe(TABLESPACE_MAP, TABLESPACE_MAP_OLD) == 0)
 	{
 		ereport(LOG,
 				(errmsg("online backup mode canceled"),
diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c
index 277c14a..65c03b2 100644
--- a/src/backend/access/transam/xlogarchive.c
+++ b/src/backend/access/transam/xlogarchive.c
@@ -451,7 +451,7 @@ KeepFileRestoredFromArchive(char *path, char *xlogfname)
 		 */
 		snprintf(oldpath, MAXPGPATH, "%s.deleted%u",
 				 xlogfpath, deletedcounter++);
-		if (rename(xlogfpath, oldpath) != 0)
+		if (rename_safe(xlogfpath, oldpath) != 0)
 		{
 			ereport(ERROR,
 					(errcode_for_file_access(),
@@ -470,7 +470,7 @@ KeepFileRestoredFromArchive(char *path, char *xlogfname)
 		reload = true;
 	}
 
-	if (rename(path, xlogfpath) < 0)
+	if (rename_safe(path, xlogfpath) < 0)
 		ereport(ERROR,
 				(errcode_for_file_access(),
 				 errmsg("could not rename file \"%s\" to \"%s\": %m",
@@ -580,7 +580,7 @@ XLogArchiveForceDone(const char *xlog)
 	StatusFilePath(archiveReady, xlog, ".ready");
 	if (stat(archiveReady, &stat_buf) == 0)
 	{
-		if (rename(archiveReady, archiveDone) < 0)
+		if (rename_safe(archiveReady, archiveDone) < 0)
 			ereport(WARNING,
 					(errcode_for_file_access(),
 					 errmsg("could not rename file \"%s\" to \"%s\": %m",
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 397f802..db54889 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -728,7 +728,7 @@ pgarch_archiveDone(char *xlog)
 
 	StatusFilePath(rlogready, xlog, ".ready");
 	StatusFilePath(rlogdone, xlog, ".done");
-	if (rename(rlogready, rlogdone) < 0)
+	if (rename_safe(rlogready, rlogdone) < 0)
 		ereport(WARNING,
 				(errcode_for_file_access(),
 				 errmsg("could not rename file \"%s\" to \"%s\": %m",
diff --git a/src/backend/replication/logical/origin.c b/src/backend/replication/logical/origin.c
index 0caf7a3..96368c7 100644
--- a/src/backend/replication/logical/origin.c
+++ b/src/backend/replication/logical/origin.c
@@ -617,16 +617,13 @@ CheckPointReplicationOrigin(void)
 	CloseTransientFile(tmpfd);
 
 	/* rename to permanent file, fsync file and directory */
-	if (rename(tmppath, path) != 0)
+	if (rename_safe(tmppath, path) != 0)
 	{
 		ereport(PANIC,
 				(errcode_for_file_access(),
 				 errmsg("could not rename file \"%s\" to \"%s\": %m",
 						tmppath, path)));
 	}
-
-	fsync_fname((char *) path, false);
-	fsync_fname("pg_logical", true);
 }
 
 /*
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index affa9b9..ead221d 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1095,7 +1095,7 @@ SaveSlotToPath(ReplicationSlot *slot, const char *dir, int elevel)
 	START_CRIT_SECTION();
 
 	fsync_fname(path, false);
-	fsync_fname((char *) dir, true);
+	fsync_fname(dir, true);
 	fsync_fname("pg_replslot", true);
 
 	END_CRIT_SECTION();
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 1b30100..37c8926 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -307,6 +307,7 @@ static void walkdir(const char *path,
 static void pre_sync_fname(const char *fname, bool isdir, int elevel);
 #endif
 static void fsync_fname_ext(const char *fname, bool isdir, int elevel);
+static void fsync_parent_path(const char *fname);
 
 
 /*
@@ -413,7 +414,7 @@ pg_flush_data(int fd, off_t offset, off_t amount)
  * indicate the OS just doesn't allow/require fsyncing directories.
  */
 void
-fsync_fname(char *fname, bool isdir)
+fsync_fname(const char *fname, bool isdir)
 {
 	int			fd;
 	int			returncode;
@@ -424,11 +425,11 @@ fsync_fname(char *fname, bool isdir)
 	 * cases here
 	 */
 	if (!isdir)
-		fd = OpenTransientFile(fname,
+		fd = OpenTransientFile((char *) fname,
 							   O_RDWR | PG_BINARY,
 							   S_IRUSR | S_IWUSR);
 	else
-		fd = OpenTransientFile(fname,
+		fd = OpenTransientFile((char *) fname,
 							   O_RDONLY | PG_BINARY,
 							   S_IRUSR | S_IWUSR);
 
@@ -461,6 +462,75 @@ fsync_fname(char *fname, bool isdir)
 	CloseTransientFile(fd);
 }
 
+/*
+ * rename_safe -- rename of a file, making it on-disk persistent
+ *
+ * This routine ensures that a rename file persists in case of a crash by using
+ * fsync on the old and new files before and after performing the rename so as
+ * this categorizes as an all-or-nothing operation.
+ */
+int
+rename_safe(const char *oldfile, const char *newfile)
+{
+	struct stat	filestats;
+
+	/*
+	 * First fsync the old entry and new entry, it this one exists, to ensure
+	 * that they are properly persistent on disk. Calling this routine with
+	 * an existing new target file is fine, rename() will first remove it
+	 * before performing its operation.
+	 */
+	fsync_fname(oldfile, false);
+	if (stat(newfile, &filestats) == 0)
+		fsync_fname(newfile, false);
+
+	/* Time to do the real deal... */
+	if (rename(oldfile, newfile) != 0)
+		return -1;
+
+	/*
+	 * Make change persistent in case of an OS crash, both the new entry and
+	 * its parent directory need to be flushed.
+	 */
+	fsync_fname(newfile, false);
+
+	/*
+	 * Same for parent directory. This routine is never called to rename
+	 * files across directories, but if this proves to become the case,
+	 * flushing the parent directory if the old file would be necessary.
+	 */
+	fsync_parent_path(newfile);
+	return 0;
+}
+
+/*
+ * link_safe -- make a file hard link, making it on-disk persistent
+ *
+ * This routine ensures that a hard link created on a file persists on system
+ * in case of a crash by using fsync where on the link generated as well as on
+ * its parent directory.
+ */
+int
+link_safe(const char *oldfile, const char *newfile)
+{
+	if (link(oldfile, newfile) < 0)
+		return -1;
+
+	/*
+	 * Make the link persistent in case of an OS crash, the new entry
+	 * generated as well as its parent directory need to be flushed.
+	 */
+	fsync_fname(newfile, false);
+
+	/*
+	 * Same for parent directory. This routine is never called to rename
+	 * files across directories, but if this proves to become the case,
+	 * flushing the parent directory if the old file would be necessary.
+	 */
+	fsync_parent_path(newfile);
+	return 0;
+}
+
 
 /*
  * InitFileAccess --- initialize this module during backend startup
@@ -2719,3 +2789,29 @@ fsync_fname_ext(const char *fname, bool isdir, int elevel)
 
 	(void) CloseTransientFile(fd);
 }
+
+/*
+ * fsync_parent_path -- fsync the parent path of a file or directory
+ *
+ * This is aimed at making file operations persistent on disk in case of
+ * an OS crash or power failure.
+ */
+static void
+fsync_parent_path(const char *fname)
+{
+	char	parentpath[MAXPGPATH];
+
+	/* Same for parent directory */
+	snprintf(parentpath, MAXPGPATH, "%s", fname);
+	get_parent_directory(parentpath);
+
+	/*
+	 * get_parent_directory() returns an empty string if the input argument
+	 * is just a file name (see comments in path.c), so handle that as being
+	 * the current directory.
+	 */
+	if (strlen(parentpath) == 0)
+		fsync_fname(".", true);
+	else
+		fsync_fname(parentpath, true);
+}
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 4a3fccb..7f3115a 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -113,7 +113,9 @@ extern int	pg_fsync_no_writethrough(int fd);
 extern int	pg_fsync_writethrough(int fd);
 extern int	pg_fdatasync(int fd);
 extern int	pg_flush_data(int fd, off_t offset, off_t amount);
-extern void fsync_fname(char *fname, bool isdir);
+extern void fsync_fname(const char *fname, bool isdir);
+extern int	rename_safe(const char *oldfile, const char *newfile);
+extern int	link_safe(const char *oldfile, const char *newfile);
 extern void SyncDataDirectory(void);
 
 /* Filename components for OpenTemporaryFile */
#51Andres Freund
andres@anarazel.de
In reply to: Michael Paquier (#50)
Re: silent data loss with ext4 / all current versions

Hi,

+/*
+ * rename_safe -- rename of a file, making it on-disk persistent
+ *
+ * This routine ensures that a rename file persists in case of a crash by using
+ * fsync on the old and new files before and after performing the rename so as
+ * this categorizes as an all-or-nothing operation.
+ */
+int
+rename_safe(const char *oldfile, const char *newfile)
+{
+	struct stat	filestats;
+
+	/*
+	 * First fsync the old entry and new entry, it this one exists, to ensure
+	 * that they are properly persistent on disk. Calling this routine with
+	 * an existing new target file is fine, rename() will first remove it
+	 * before performing its operation.
+	 */
+	fsync_fname(oldfile, false);
+	if (stat(newfile, &filestats) == 0)
+		fsync_fname(newfile, false);

I don't think we want any stat()s here. I'd much, much rather check open
for ENOENT.

+/*
+ * link_safe -- make a file hard link, making it on-disk persistent
+ *
+ * This routine ensures that a hard link created on a file persists on system
+ * in case of a crash by using fsync where on the link generated as well as on
+ * its parent directory.
+ */
+int
+link_safe(const char *oldfile, const char *newfile)
+{

If we go for a new abstraction here, I'd rather make it
'replace_file_safe' or something, and move the link/rename code #ifdef
into it.

+	if (link(oldfile, newfile) < 0)
+		return -1;
+
+	/*
+	 * Make the link persistent in case of an OS crash, the new entry
+	 * generated as well as its parent directory need to be flushed.
+	 */
+	fsync_fname(newfile, false);
+
+	/*
+	 * Same for parent directory. This routine is never called to rename
+	 * files across directories, but if this proves to become the case,
+	 * flushing the parent directory if the old file would be necessary.
+	 */
+	fsync_parent_path(newfile);
+	return 0;

I think it's a seriously bad idea to encode that knowledge in such a
general sounding routine. We could however argue that this is about
safely replacing the *target* file; not about safely removing the old
file.

Currently I'm inclined to apply this to master soon. But I think we
might want to wait a while with backpatching. The recent fsync upgrade
disaster kinda makes me a bit careful...

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#52Michael Paquier
michael.paquier@gmail.com
In reply to: Andres Freund (#51)
Re: silent data loss with ext4 / all current versions

On Fri, Mar 4, 2016 at 4:06 AM, Andres Freund <andres@anarazel.de> wrote:

Hi,

Thanks for the review.

+/*
+ * rename_safe -- rename of a file, making it on-disk persistent
+ *
+ * This routine ensures that a rename file persists in case of a crash by using
+ * fsync on the old and new files before and after performing the rename so as
+ * this categorizes as an all-or-nothing operation.
+ */
+int
+rename_safe(const char *oldfile, const char *newfile)
+{
+     struct stat     filestats;
+
+     /*
+      * First fsync the old entry and new entry, it this one exists, to ensure
+      * that they are properly persistent on disk. Calling this routine with
+      * an existing new target file is fine, rename() will first remove it
+      * before performing its operation.
+      */
+     fsync_fname(oldfile, false);
+     if (stat(newfile, &filestats) == 0)
+             fsync_fname(newfile, false);

I don't think we want any stat()s here. I'd much, much rather check open
for ENOENT.

OK. So you mean more or less that, right?
int fd;
fd = OpenTransientFile(newfile, PG_BINARY | O_RDONLY, 0);
if (fd < 0)
{
if (errno != ENOENT)
return -1;
}
else
{
pg_fsync(fd);
CloseTransientFile(fd);
}

+/*
+ * link_safe -- make a file hard link, making it on-disk persistent
+ *
+ * This routine ensures that a hard link created on a file persists on system
+ * in case of a crash by using fsync where on the link generated as well as on
+ * its parent directory.
+ */
+int
+link_safe(const char *oldfile, const char *newfile)
+{

If we go for a new abstraction here, I'd rather make it
'replace_file_safe' or something, and move the link/rename code #ifdef
into it.

Hm. OK. I don't see any reason why switching to link() even in code
paths like KeepFileRestoredFromArchive() or pgarch_archiveDone() would
be a problem thinking about it. Should HAVE_WORKING_LINK be available
on a platform we can combine it with unlink. Is that in line with what
you think?

+     if (link(oldfile, newfile) < 0)
+             return -1;
+
+     /*
+      * Make the link persistent in case of an OS crash, the new entry
+      * generated as well as its parent directory need to be flushed.
+      */
+     fsync_fname(newfile, false);
+
+     /*
+      * Same for parent directory. This routine is never called to rename
+      * files across directories, but if this proves to become the case,
+      * flushing the parent directory if the old file would be necessary.
+      */
+     fsync_parent_path(newfile);
+     return 0;

I think it's a seriously bad idea to encode that knowledge in such a
general sounding routine. We could however argue that this is about
safely replacing the *target* file; not about safely removing the old
file.

Not sure I am following here. Are you referring to the fact that if
the new file and old file are on different directories would make this
routine unreliable? Because yes that's the case if we want to make
both of them persistent, and I think we want to do so. Do you suggest
to correct this comment to remove the mention to the old file's parent
directory because we just care about having the new file as being
persistent? Or do you suggest that we should actually extend this
routine so as we fsync both the new and old file's parent directory if
they differ?

Currently I'm inclined to apply this to master soon. But I think we
might want to wait a while with backpatching. The recent fsync upgrade
disaster kinda makes me a bit careful...

There have not been actual field complaints about that yet. That's
fine for me to wait a bit.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#53Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Michael Paquier (#52)
Re: silent data loss with ext4 / all current versions

I would like to have a patch for this finalized today, so that we can
apply to master before or during the weekend; with it in the tree for
about a week we can be more confident and backpatch close to next
weekend, so that we see it in the next set of minor releases. Does that
sound good?

--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#54Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#53)
Re: silent data loss with ext4 / all current versions

Alvaro Herrera <alvherre@2ndquadrant.com> writes:

I would like to have a patch for this finalized today, so that we can
apply to master before or during the weekend; with it in the tree for
about a week we can be more confident and backpatch close to next
weekend, so that we see it in the next set of minor releases. Does that
sound good?

I see no reason to wait before backpatching. If you're concerned about
having testing, the more branches it is in, the more buildfarm cycles
you will get on it. And we're not going to cut any releases in between,
so what's the benefit of not having it there?

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#55Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#54)
Re: silent data loss with ext4 / all current versions

On Fri, Mar 4, 2016 at 11:09 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Alvaro Herrera <alvherre@2ndquadrant.com> writes:

I would like to have a patch for this finalized today, so that we can
apply to master before or during the weekend; with it in the tree for
about a week we can be more confident and backpatch close to next
weekend, so that we see it in the next set of minor releases. Does that
sound good?

I see no reason to wait before backpatching. If you're concerned about
having testing, the more branches it is in, the more buildfarm cycles
you will get on it. And we're not going to cut any releases in between,
so what's the benefit of not having it there?

Agreed.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#56Michael Paquier
michael.paquier@gmail.com
In reply to: Robert Haas (#55)
Re: silent data loss with ext4 / all current versions

On Sat, Mar 5, 2016 at 1:23 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Mar 4, 2016 at 11:09 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Alvaro Herrera <alvherre@2ndquadrant.com> writes:

I would like to have a patch for this finalized today, so that we can
apply to master before or during the weekend; with it in the tree for
about a week we can be more confident and backpatch close to next
weekend, so that we see it in the next set of minor releases. Does that
sound good?

I see no reason to wait before backpatching. If you're concerned about
having testing, the more branches it is in, the more buildfarm cycles
you will get on it. And we're not going to cut any releases in between,
so what's the benefit of not having it there?

Agreed.

OK. I could produce that by tonight my time, not before unfortunately.
And FWIW, per the comments of Andres, it is not clear to me what we
gain by having a common routine for link() and rename() knowing that
half the code paths performing a rename do not rely on link(). At
least it sound dangerous to me to introduce a dependency to link() in
code paths that depend just on rename() for back branches. On HEAD, we
could be more adventurous for sure. Regarding the replacement of
stat() by something relying on OpenTransientFile I agree though. For
the flush of the parent directory in link_safe() we'd still want to do
it, and we are fine to not flush the parent directory of the old file
because the backend does not move files across paths.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#57Andres Freund
andres@anarazel.de
In reply to: Michael Paquier (#52)
Re: silent data loss with ext4 / all current versions

On 2016-03-04 14:51:50 +0900, Michael Paquier wrote:

On Fri, Mar 4, 2016 at 4:06 AM, Andres Freund <andres@anarazel.de> wrote:

Hi,

Thanks for the review.

+/*
+ * rename_safe -- rename of a file, making it on-disk persistent
+ *
+ * This routine ensures that a rename file persists in case of a crash by using
+ * fsync on the old and new files before and after performing the rename so as
+ * this categorizes as an all-or-nothing operation.
+ */
+int
+rename_safe(const char *oldfile, const char *newfile)
+{
+     struct stat     filestats;
+
+     /*
+      * First fsync the old entry and new entry, it this one exists, to ensure
+      * that they are properly persistent on disk. Calling this routine with
+      * an existing new target file is fine, rename() will first remove it
+      * before performing its operation.
+      */
+     fsync_fname(oldfile, false);
+     if (stat(newfile, &filestats) == 0)
+             fsync_fname(newfile, false);

I don't think we want any stat()s here. I'd much, much rather check open
for ENOENT.

OK. So you mean more or less that, right?
int fd;
fd = OpenTransientFile(newfile, PG_BINARY | O_RDONLY, 0);
if (fd < 0)
{
if (errno != ENOENT)
return -1;
}
else
{
pg_fsync(fd);
CloseTransientFile(fd);
}

Yes. Otherwise the check is racy: The file could be gone by the time you
do the fsync; leading to a spurious ERROR (which often would get
promoted to a PANIC).

+/*
+ * link_safe -- make a file hard link, making it on-disk persistent
+ *
+ * This routine ensures that a hard link created on a file persists on system
+ * in case of a crash by using fsync where on the link generated as well as on
+ * its parent directory.
+ */
+int
+link_safe(const char *oldfile, const char *newfile)
+{

If we go for a new abstraction here, I'd rather make it
'replace_file_safe' or something, and move the link/rename code #ifdef
into it.

Hm. OK. I don't see any reason why switching to link() even in code
paths like KeepFileRestoredFromArchive() or pgarch_archiveDone() would
be a problem thinking about it. Should HAVE_WORKING_LINK be available
on a platform we can combine it with unlink. Is that in line with what
you think?

I wasn't trying to suggest we should replace all rename codepaths with
the link wrapper, just the ones that already have a HAVE_WORKING_LINK
check. The name of the routine I suggested is bad though...

+     if (link(oldfile, newfile) < 0)
+             return -1;
+
+     /*
+      * Make the link persistent in case of an OS crash, the new entry
+      * generated as well as its parent directory need to be flushed.
+      */
+     fsync_fname(newfile, false);
+
+     /*
+      * Same for parent directory. This routine is never called to rename
+      * files across directories, but if this proves to become the case,
+      * flushing the parent directory if the old file would be necessary.
+      */
+     fsync_parent_path(newfile);
+     return 0;

I think it's a seriously bad idea to encode that knowledge in such a
general sounding routine. We could however argue that this is about
safely replacing the *target* file; not about safely removing the old
file.

Not sure I am following here. Are you referring to the fact that if
the new file and old file are on different directories would make this
routine unreliable?

Yes.

Because yes that's the case if we want to make both of them
persistent, and I think we want to do so.

That's one way.

Do you suggest to correct this comment to remove the mention to the
old file's parent directory because we just care about having the new
file as being persistent?

That's one approach, yes. Combined with the fact that you can't actually
reliably rename across directories, the two could be on different
filesystems after all, that'd be a suitable defense. It just needs to be
properly documented in the function header, not at the bottom.

Regards,

Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#58Andres Freund
andres@anarazel.de
In reply to: Michael Paquier (#56)
Re: silent data loss with ext4 / all current versions

On 2016-03-05 07:29:35 +0900, Michael Paquier wrote:

OK. I could produce that by tonight my time, not before unfortunately.

I'm switching to this patch, after pushing the pending logical decoding
fixes. Probably not today, but tomorrow PST afternoon should work.

And FWIW, per the comments of Andres, it is not clear to me what we
gain by having a common routine for link() and rename() knowing that
half the code paths performing a rename do not rely on link().

I'm not talking about replacing all renames with this. Just the ones
that currently use link(). There's not much point in introducing
link_safe(), when all the callers have the same duplicated code, with a
fallback to rename().

Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#59Michael Paquier
michael.paquier@gmail.com
In reply to: Andres Freund (#57)
Re: silent data loss with ext4 / all current versions

On Sat, Mar 5, 2016 at 7:35 AM, Andres Freund <andres@anarazel.de> wrote:

On 2016-03-04 14:51:50 +0900, Michael Paquier wrote:

On Fri, Mar 4, 2016 at 4:06 AM, Andres Freund <andres@anarazel.de> wrote:

I don't think we want any stat()s here. I'd much, much rather check open
for ENOENT.

OK. So you mean more or less that, right?
int fd;
fd = OpenTransientFile(newfile, PG_BINARY | O_RDONLY, 0);
if (fd < 0)
{
if (errno != ENOENT)
return -1;
}
else
{
pg_fsync(fd);
CloseTransientFile(fd);
}

Yes. Otherwise the check is racy: The file could be gone by the time you
do the fsync; leading to a spurious ERROR (which often would get
promoted to a PANIC).

Yeah, that makes sense.

+/*
+ * link_safe -- make a file hard link, making it on-disk persistent
+ *
+ * This routine ensures that a hard link created on a file persists on system
+ * in case of a crash by using fsync where on the link generated as well as on
+ * its parent directory.
+ */
+int
+link_safe(const char *oldfile, const char *newfile)
+{

If we go for a new abstraction here, I'd rather make it
'replace_file_safe' or something, and move the link/rename code #ifdef
into it.

Hm. OK. I don't see any reason why switching to link() even in code
paths like KeepFileRestoredFromArchive() or pgarch_archiveDone() would
be a problem thinking about it. Should HAVE_WORKING_LINK be available
on a platform we can combine it with unlink. Is that in line with what
you think?

I wasn't trying to suggest we should replace all rename codepaths with
the link wrapper, just the ones that already have a HAVE_WORKING_LINK
check. The name of the routine I suggested is bad though...

So we'd introduce a first routine rename_or_link_safe(), say replace_safe().

Do you suggest to correct this comment to remove the mention to the
old file's parent directory because we just care about having the new
file as being persistent?

That's one approach, yes. Combined with the fact that you can't actually
reliably rename across directories, the two could be on different
filesystems after all, that'd be a suitable defense. It just needs to be
properly documented in the function header, not at the bottom.

OK. Got it. Or the two could be on the same filesystem. Still, link()
and rename() do not support doing their stuff on different filesystems
(EXDEV).
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#60Michael Paquier
michael.paquier@gmail.com
In reply to: Andres Freund (#58)
Re: silent data loss with ext4 / all current versions

On Sat, Mar 5, 2016 at 7:37 AM, Andres Freund <andres@anarazel.de> wrote:

On 2016-03-05 07:29:35 +0900, Michael Paquier wrote:

OK. I could produce that by tonight my time, not before unfortunately.

I'm switching to this patch, after pushing the pending logical decoding
fixes. Probably not today, but tomorrow PST afternoon should work.

OK, so if that's the case there is not need to step on your toes seen from here.

And FWIW, per the comments of Andres, it is not clear to me what we
gain by having a common routine for link() and rename() knowing that
half the code paths performing a rename do not rely on link().

I'm not talking about replacing all renames with this. Just the ones
that currently use link(). There's not much point in introducing
link_safe(), when all the callers have the same duplicated code, with a
fallback to rename().

Indeed, that's the case. I don't have a better name than replace_safe
though. replace_paranoid is not a very appealing name either.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#61Andres Freund
andres@anarazel.de
In reply to: Michael Paquier (#59)
Re: silent data loss with ext4 / all current versions

On 2016-03-05 07:43:00 +0900, Michael Paquier wrote:

On Sat, Mar 5, 2016 at 7:35 AM, Andres Freund <andres@anarazel.de> wrote:

On 2016-03-04 14:51:50 +0900, Michael Paquier wrote:

On Fri, Mar 4, 2016 at 4:06 AM, Andres Freund <andres@anarazel.de> wrote:
Hm. OK. I don't see any reason why switching to link() even in code
paths like KeepFileRestoredFromArchive() or pgarch_archiveDone() would
be a problem thinking about it. Should HAVE_WORKING_LINK be available
on a platform we can combine it with unlink. Is that in line with what
you think?

I wasn't trying to suggest we should replace all rename codepaths with
the link wrapper, just the ones that already have a HAVE_WORKING_LINK
check. The name of the routine I suggested is bad though...

So we'd introduce a first routine rename_or_link_safe(), say replace_safe().

Or actually maybe just link_safe(), which falls back to access() &&
rename() if !HAVE_WORKING_LINK.

That's one approach, yes. Combined with the fact that you can't actually
reliably rename across directories, the two could be on different
filesystems after all, that'd be a suitable defense. It just needs to be
properly documented in the function header, not at the bottom.

OK. Got it. Or the two could be on the same filesystem.

Still, link() and rename() do not support doing their stuff on
different filesystems (EXDEV).

That's my point ...

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#62Michael Paquier
michael.paquier@gmail.com
In reply to: Andres Freund (#61)
1 attachment(s)
Re: silent data loss with ext4 / all current versions

On Sat, Mar 5, 2016 at 7:47 AM, Andres Freund <andres@anarazel.de> wrote:

On 2016-03-05 07:43:00 +0900, Michael Paquier wrote:

On Sat, Mar 5, 2016 at 7:35 AM, Andres Freund <andres@anarazel.de> wrote:

On 2016-03-04 14:51:50 +0900, Michael Paquier wrote:

On Fri, Mar 4, 2016 at 4:06 AM, Andres Freund <andres@anarazel.de> wrote:
Hm. OK. I don't see any reason why switching to link() even in code
paths like KeepFileRestoredFromArchive() or pgarch_archiveDone() would
be a problem thinking about it. Should HAVE_WORKING_LINK be available
on a platform we can combine it with unlink. Is that in line with what
you think?

I wasn't trying to suggest we should replace all rename codepaths with
the link wrapper, just the ones that already have a HAVE_WORKING_LINK
check. The name of the routine I suggested is bad though...

So we'd introduce a first routine rename_or_link_safe(), say replace_safe().

Or actually maybe just link_safe(), which falls back to access() &&
rename() if !HAVE_WORKING_LINK.

That's one approach, yes. Combined with the fact that you can't actually
reliably rename across directories, the two could be on different
filesystems after all, that'd be a suitable defense. It just needs to be
properly documented in the function header, not at the bottom.

OK. Got it. Or the two could be on the same filesystem.

Still, link() and rename() do not support doing their stuff on
different filesystems (EXDEV).

That's my point ...

OK, I hacked a v7:
- Move the link()/rename() group with HAVE_WORKING_LINK into a single
routine, making the previous link_safe renamed to replace_safe. This
is sharing a lot of things with rename_safe. I am not sure it is worth
complicating the code more this way by having a common single routine
for whole. Thoughts welcome. Honestly, I kind of liked the separation
with link_safe/rename_safe of previous patches because link_safe could
have been directly used by extensions and plugins btw.
- Remove the call of stat() in rename_safe() and implement a logic
depending on OpenTransientFile()/pg_fsync() to flush any existing
target file before performing the rename.
Andres, feel free to use this patch as a base, perhaps that will help.
--
Michael

Attachments:

xlog-fsync-v7.patchapplication/x-patch; name=xlog-fsync-v7.patchDownload
diff --git a/src/backend/access/transam/timeline.c b/src/backend/access/transam/timeline.c
index f6da673..80ec293 100644
--- a/src/backend/access/transam/timeline.c
+++ b/src/backend/access/transam/timeline.c
@@ -417,25 +417,12 @@ writeTimeLineHistory(TimeLineID newTLI, TimeLineID parentTLI,
 	 */
 	TLHistoryFilePath(path, newTLI);
 
-	/*
-	 * Prefer link() to rename() here just to be really sure that we don't
-	 * overwrite an existing file.  However, there shouldn't be one, so
-	 * rename() is an acceptable substitute except for the truly paranoid.
-	 */
-#if HAVE_WORKING_LINK
-	if (link(tmppath, path) < 0)
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not link file \"%s\" to \"%s\": %m",
-						tmppath, path)));
-	unlink(tmppath);
-#else
-	if (rename(tmppath, path) < 0)
+	/* And perform the rename */
+	if (replace_safe(tmppath, path) < 0)
 		ereport(ERROR,
 				(errcode_for_file_access(),
-				 errmsg("could not rename file \"%s\" to \"%s\": %m",
+				 errmsg("could not replace file \"%s\" to \"%s\": %m",
 						tmppath, path)));
-#endif
 
 	/* The history file can be archived immediately. */
 	if (XLogArchivingActive())
@@ -507,25 +494,12 @@ writeTimeLineHistoryFile(TimeLineID tli, char *content, int size)
 	 */
 	TLHistoryFilePath(path, tli);
 
-	/*
-	 * Prefer link() to rename() here just to be really sure that we don't
-	 * overwrite an existing logfile.  However, there shouldn't be one, so
-	 * rename() is an acceptable substitute except for the truly paranoid.
-	 */
-#if HAVE_WORKING_LINK
-	if (link(tmppath, path) < 0)
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not link file \"%s\" to \"%s\": %m",
-						tmppath, path)));
-	unlink(tmppath);
-#else
-	if (rename(tmppath, path) < 0)
+	/* And perform the rename */
+	if (replace_safe(tmppath, path) < 0)
 		ereport(ERROR,
 				(errcode_for_file_access(),
-				 errmsg("could not rename file \"%s\" to \"%s\": %m",
+				 errmsg("could not replace file \"%s\" to \"%s\": %m",
 						tmppath, path)));
-#endif
 }
 
 /*
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 94b79ac..6c4f36d 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -3298,35 +3298,17 @@ InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
 		}
 	}
 
-	/*
-	 * Prefer link() to rename() here just to be really sure that we don't
-	 * overwrite an existing logfile.  However, there shouldn't be one, so
-	 * rename() is an acceptable substitute except for the truly paranoid.
-	 */
-#if HAVE_WORKING_LINK
-	if (link(tmppath, path) < 0)
+	/* rename the segment file */
+	if (replace_safe(tmppath, path) < 0)
 	{
 		if (use_lock)
 			LWLockRelease(ControlFileLock);
 		ereport(LOG,
 				(errcode_for_file_access(),
-				 errmsg("could not link file \"%s\" to \"%s\" (initialization of log file): %m",
+				 errmsg("could not replace file \"%s\" to \"%s\" (initialization of log file): %m",
 						tmppath, path)));
 		return false;
 	}
-	unlink(tmppath);
-#else
-	if (rename(tmppath, path) < 0)
-	{
-		if (use_lock)
-			LWLockRelease(ControlFileLock);
-		ereport(LOG,
-				(errcode_for_file_access(),
-				 errmsg("could not rename file \"%s\" to \"%s\" (initialization of log file): %m",
-						tmppath, path)));
-		return false;
-	}
-#endif
 
 	if (use_lock)
 		LWLockRelease(ControlFileLock);
@@ -3840,7 +3822,7 @@ RemoveXlogFile(const char *segname, XLogRecPtr PriorRedoPtr, XLogRecPtr endptr)
 		 * flag, rename will fail. We'll try again at the next checkpoint.
 		 */
 		snprintf(newpath, MAXPGPATH, "%s.deleted", path);
-		if (rename(path, newpath) != 0)
+		if (rename_safe(path, newpath) != 0)
 		{
 			ereport(LOG,
 					(errcode_for_file_access(),
@@ -5339,7 +5321,7 @@ exitArchiveRecovery(TimeLineID endTLI, XLogRecPtr endOfLog)
 	 * re-enter archive recovery mode in a subsequent crash.
 	 */
 	unlink(RECOVERY_COMMAND_DONE);
-	if (rename(RECOVERY_COMMAND_FILE, RECOVERY_COMMAND_DONE) != 0)
+	if (rename_safe(RECOVERY_COMMAND_FILE, RECOVERY_COMMAND_DONE) != 0)
 		ereport(FATAL,
 				(errcode_for_file_access(),
 				 errmsg("could not rename file \"%s\" to \"%s\": %m",
@@ -6186,7 +6168,7 @@ StartupXLOG(void)
 		if (stat(TABLESPACE_MAP, &st) == 0)
 		{
 			unlink(TABLESPACE_MAP_OLD);
-			if (rename(TABLESPACE_MAP, TABLESPACE_MAP_OLD) == 0)
+			if (rename_safe(TABLESPACE_MAP, TABLESPACE_MAP_OLD) == 0)
 				ereport(LOG,
 					(errmsg("ignoring file \"%s\" because no file \"%s\" exists",
 							TABLESPACE_MAP, BACKUP_LABEL_FILE),
@@ -6549,7 +6531,7 @@ StartupXLOG(void)
 		if (haveBackupLabel)
 		{
 			unlink(BACKUP_LABEL_OLD);
-			if (rename(BACKUP_LABEL_FILE, BACKUP_LABEL_OLD) != 0)
+			if (rename_safe(BACKUP_LABEL_FILE, BACKUP_LABEL_OLD) != 0)
 				ereport(FATAL,
 						(errcode_for_file_access(),
 						 errmsg("could not rename file \"%s\" to \"%s\": %m",
@@ -6566,7 +6548,7 @@ StartupXLOG(void)
 		if (haveTblspcMap)
 		{
 			unlink(TABLESPACE_MAP_OLD);
-			if (rename(TABLESPACE_MAP, TABLESPACE_MAP_OLD) != 0)
+			if (rename_safe(TABLESPACE_MAP, TABLESPACE_MAP_OLD) != 0)
 				ereport(FATAL,
 						(errcode_for_file_access(),
 						 errmsg("could not rename file \"%s\" to \"%s\": %m",
@@ -7347,7 +7329,7 @@ StartupXLOG(void)
 				 */
 				XLogArchiveCleanup(partialfname);
 
-				if (rename(origpath, partialpath) != 0)
+				if (rename_safe(origpath, partialpath) != 0)
 					ereport(ERROR,
 							(errcode_for_file_access(),
 						 errmsg("could not rename file \"%s\" to \"%s\": %m",
@@ -10907,7 +10889,7 @@ CancelBackup(void)
 	/* remove leftover file from previously canceled backup if it exists */
 	unlink(BACKUP_LABEL_OLD);
 
-	if (rename(BACKUP_LABEL_FILE, BACKUP_LABEL_OLD) != 0)
+	if (rename_safe(BACKUP_LABEL_FILE, BACKUP_LABEL_OLD) != 0)
 	{
 		ereport(WARNING,
 				(errcode_for_file_access(),
@@ -10930,7 +10912,7 @@ CancelBackup(void)
 	/* remove leftover file from previously canceled backup if it exists */
 	unlink(TABLESPACE_MAP_OLD);
 
-	if (rename(TABLESPACE_MAP, TABLESPACE_MAP_OLD) == 0)
+	if (rename_safe(TABLESPACE_MAP, TABLESPACE_MAP_OLD) == 0)
 	{
 		ereport(LOG,
 				(errmsg("online backup mode canceled"),
diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c
index 277c14a..65c03b2 100644
--- a/src/backend/access/transam/xlogarchive.c
+++ b/src/backend/access/transam/xlogarchive.c
@@ -451,7 +451,7 @@ KeepFileRestoredFromArchive(char *path, char *xlogfname)
 		 */
 		snprintf(oldpath, MAXPGPATH, "%s.deleted%u",
 				 xlogfpath, deletedcounter++);
-		if (rename(xlogfpath, oldpath) != 0)
+		if (rename_safe(xlogfpath, oldpath) != 0)
 		{
 			ereport(ERROR,
 					(errcode_for_file_access(),
@@ -470,7 +470,7 @@ KeepFileRestoredFromArchive(char *path, char *xlogfname)
 		reload = true;
 	}
 
-	if (rename(path, xlogfpath) < 0)
+	if (rename_safe(path, xlogfpath) < 0)
 		ereport(ERROR,
 				(errcode_for_file_access(),
 				 errmsg("could not rename file \"%s\" to \"%s\": %m",
@@ -580,7 +580,7 @@ XLogArchiveForceDone(const char *xlog)
 	StatusFilePath(archiveReady, xlog, ".ready");
 	if (stat(archiveReady, &stat_buf) == 0)
 	{
-		if (rename(archiveReady, archiveDone) < 0)
+		if (rename_safe(archiveReady, archiveDone) < 0)
 			ereport(WARNING,
 					(errcode_for_file_access(),
 					 errmsg("could not rename file \"%s\" to \"%s\": %m",
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 397f802..db54889 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -728,7 +728,7 @@ pgarch_archiveDone(char *xlog)
 
 	StatusFilePath(rlogready, xlog, ".ready");
 	StatusFilePath(rlogdone, xlog, ".done");
-	if (rename(rlogready, rlogdone) < 0)
+	if (rename_safe(rlogready, rlogdone) < 0)
 		ereport(WARNING,
 				(errcode_for_file_access(),
 				 errmsg("could not rename file \"%s\" to \"%s\": %m",
diff --git a/src/backend/replication/logical/origin.c b/src/backend/replication/logical/origin.c
index 0caf7a3..96368c7 100644
--- a/src/backend/replication/logical/origin.c
+++ b/src/backend/replication/logical/origin.c
@@ -617,16 +617,13 @@ CheckPointReplicationOrigin(void)
 	CloseTransientFile(tmpfd);
 
 	/* rename to permanent file, fsync file and directory */
-	if (rename(tmppath, path) != 0)
+	if (rename_safe(tmppath, path) != 0)
 	{
 		ereport(PANIC,
 				(errcode_for_file_access(),
 				 errmsg("could not rename file \"%s\" to \"%s\": %m",
 						tmppath, path)));
 	}
-
-	fsync_fname((char *) path, false);
-	fsync_fname("pg_logical", true);
 }
 
 /*
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index affa9b9..ead221d 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1095,7 +1095,7 @@ SaveSlotToPath(ReplicationSlot *slot, const char *dir, int elevel)
 	START_CRIT_SECTION();
 
 	fsync_fname(path, false);
-	fsync_fname((char *) dir, true);
+	fsync_fname(dir, true);
 	fsync_fname("pg_replslot", true);
 
 	END_CRIT_SECTION();
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 1b30100..85bab37 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -307,6 +307,7 @@ static void walkdir(const char *path,
 static void pre_sync_fname(const char *fname, bool isdir, int elevel);
 #endif
 static void fsync_fname_ext(const char *fname, bool isdir, int elevel);
+static void fsync_parent_path(const char *fname);
 
 
 /*
@@ -413,7 +414,7 @@ pg_flush_data(int fd, off_t offset, off_t amount)
  * indicate the OS just doesn't allow/require fsyncing directories.
  */
 void
-fsync_fname(char *fname, bool isdir)
+fsync_fname(const char *fname, bool isdir)
 {
 	int			fd;
 	int			returncode;
@@ -424,11 +425,11 @@ fsync_fname(char *fname, bool isdir)
 	 * cases here
 	 */
 	if (!isdir)
-		fd = OpenTransientFile(fname,
+		fd = OpenTransientFile((char *) fname,
 							   O_RDWR | PG_BINARY,
 							   S_IRUSR | S_IWUSR);
 	else
-		fd = OpenTransientFile(fname,
+		fd = OpenTransientFile((char *) fname,
 							   O_RDONLY | PG_BINARY,
 							   S_IRUSR | S_IWUSR);
 
@@ -461,6 +462,100 @@ fsync_fname(char *fname, bool isdir)
 	CloseTransientFile(fd);
 }
 
+/*
+ * rename_safe -- rename of a file, making it on-disk persistent
+ *
+ * This routine ensures that a rename file persists in case of a crash by
+ * using fsync on the old and new files before and after performing the
+ * rename so as this categorizes as an all-or-nothing operation.
+ *
+ * rename() is not reliable across directories, particularly if the origin
+ * point and the target point are located on different mounted partitions
+ * so this routine should be called when the replacement of a file is
+ * located in the same directory as its origin file.
+ */
+int
+rename_safe(const char *oldfile, const char *newfile)
+{
+	int		fd;
+
+	/*
+	 * First fsync the old entry and new entry, it this one exists, to ensure
+	 * that they are properly persistent on disk. Calling this routine with
+	 * an existing new target file is fine, rename() will first remove it
+	 * before performing its operation.
+	 */
+	fsync_fname(oldfile, false);
+
+	fd = OpenTransientFile((char *) newfile, PG_BINARY | O_RDONLY, 0);
+	if (fd < 0)
+	{
+		if (errno != ENOENT)
+			 return -1;
+	}
+	else
+	{
+		if (pg_fsync(fd) != 0)
+			ereport(LOG,
+					(errcode_for_file_access(),
+					 errmsg("could not write to file \"%s\": %m",
+							newfile)));
+		(void) CloseTransientFile(fd);
+	}
+
+	/* Time to do the real deal... */
+	if (rename(oldfile, newfile) != 0)
+		return -1;
+
+	/*
+	 * Make change persistent in case of an OS crash, both the new entry and
+	 * its parent directory need to be flushed.
+	 */
+	fsync_fname(newfile, false);
+
+	/* Same for parent directory */
+	fsync_parent_path(newfile);
+	return 0;
+}
+
+/*
+ * replace_safe -- replace a file, making it on-disk persistent
+ *
+ * This routine ensures that a file link or rename on a file persists on
+ * system in case of a crash by using fsync where on the link generated
+ * as well as on its parent directory. link() is preferred to rename() just
+ * to be really sure that an existing file is not overwritten. However,
+ * there should not be an existing file when calling this routine, so rename()
+ * is an acceptable substitute except for the truly paranoid.
+ *
+ * rename() and link() are not reliable across directories, particularly
+ * if the origin point and the target point are located on different mounted
+ * partitions, so this routine should be called when the replacement of a
+ * file is located in the same directory as its origin file.
+ */
+int
+replace_safe(const char *oldfile, const char *newfile)
+{
+#if HAVE_WORKING_LINK
+	if (link(oldfile, newfile) < 0)
+		return -1;
+	unlink(oldfile);
+#else
+	if (rename(oldfile, newfile) < 0)
+		return -1;
+#endif
+
+	/*
+	 * Make change persistent in case of an OS crash, both the new entry and
+	 * its parent directory need to be flushed.
+	 */
+	fsync_fname(newfile, false);
+
+	/* Same for parent directory */
+	fsync_parent_path(newfile);
+	return 0;
+}
+
 
 /*
  * InitFileAccess --- initialize this module during backend startup
@@ -2719,3 +2814,29 @@ fsync_fname_ext(const char *fname, bool isdir, int elevel)
 
 	(void) CloseTransientFile(fd);
 }
+
+/*
+ * fsync_parent_path -- fsync the parent path of a file or directory
+ *
+ * This is aimed at making file operations persistent on disk in case of
+ * an OS crash or power failure.
+ */
+static void
+fsync_parent_path(const char *fname)
+{
+	char	parentpath[MAXPGPATH];
+
+	/* Same for parent directory */
+	snprintf(parentpath, MAXPGPATH, "%s", fname);
+	get_parent_directory(parentpath);
+
+	/*
+	 * get_parent_directory() returns an empty string if the input argument
+	 * is just a file name (see comments in path.c), so handle that as being
+	 * the current directory.
+	 */
+	if (strlen(parentpath) == 0)
+		fsync_fname(".", true);
+	else
+		fsync_fname(parentpath, true);
+}
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 4a3fccb..60836a1 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -113,7 +113,9 @@ extern int	pg_fsync_no_writethrough(int fd);
 extern int	pg_fsync_writethrough(int fd);
 extern int	pg_fdatasync(int fd);
 extern int	pg_flush_data(int fd, off_t offset, off_t amount);
-extern void fsync_fname(char *fname, bool isdir);
+extern void fsync_fname(const char *fname, bool isdir);
+extern int	rename_safe(const char *oldfile, const char *newfile);
+extern int	replace_safe(const char *oldfile, const char *newfile);
 extern void SyncDataDirectory(void);
 
 /* Filename components for OpenTemporaryFile */
#63Andres Freund
andres@anarazel.de
In reply to: Michael Paquier (#62)
Re: silent data loss with ext4 / all current versions

On 2016-03-05 22:25:36 +0900, Michael Paquier wrote:

OK, I hacked a v7:
- Move the link()/rename() group with HAVE_WORKING_LINK into a single
routine, making the previous link_safe renamed to replace_safe. This
is sharing a lot of things with rename_safe. I am not sure it is worth
complicating the code more this way by having a common single routine
for whole. Thoughts welcome. Honestly, I kind of liked the separation
with link_safe/rename_safe of previous patches because link_safe could
have been directly used by extensions and plugins btw.
- Remove the call of stat() in rename_safe() and implement a logic
depending on OpenTransientFile()/pg_fsync() to flush any existing
target file before performing the rename.
Andres, feel free to use this patch as a base, perhaps that will help.

I started working on this; delayed by taking longer than planned on the
logical decoding stuff (quite a bit complicated by
e1a11d93111ff3fba7a91f3f2ac0b0aca16909a8). I'm not very happy with the
error handling as it is right now. For one, you have rename_safe return
error codes, and act on them in the callers, but on the other hand you
call fsync_fname which always errors out in case of failure. I also
don't like the new messages much.

Will continue working on this tomorrow.

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#64Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#63)
1 attachment(s)
Re: silent data loss with ext4 / all current versions

Hi,

On 2016-03-05 19:54:05 -0800, Andres Freund wrote:

I started working on this; delayed by taking longer than planned on the
logical decoding stuff (quite a bit complicated by
e1a11d93111ff3fba7a91f3f2ac0b0aca16909a8). I'm not very happy with the
error handling as it is right now. For one, you have rename_safe return
error codes, and act on them in the callers, but on the other hand you
call fsync_fname which always errors out in case of failure. I also
don't like the new messages much.

Will continue working on this tomorrow.

So, here's my current version of this. I've not performed any testing
yet, and it's hot of the press. There's some comment smithing
needed. But otherwise I'm starting to like this.

Changes:
* renamed rename_safe to durable_rename
* renamed replace_safe to durable_link_or_rename (there was never any
replacing going on)
* pass through elevel to the underlying routines, otherwise we could
error out with ERROR when we don't want to. That's particularly
important in case of things like InstallXLogFileSegment().
* made fsync_fname use fsync_fname_ext, add 'accept permission errors'
param
* have walkdir call a wrapper, to add ignore_perms param

What do you think?

I sure wish we had the recovery testing et al. in all the back
branches...

- Andres

Attachments:

0001-durable-rename-andres-v8.patchtext/x-patch; charset=us-asciiDownload
From e60caf094f68496658e969cdd4df919fd66e9d29 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sun, 6 Mar 2016 22:20:17 -0800
Subject: [PATCH 1/2] durable-rename-andres-v8

---
 src/backend/access/transam/timeline.c    |  40 +----
 src/backend/access/transam/xlog.c        |  64 ++------
 src/backend/access/transam/xlogarchive.c |  21 +--
 src/backend/postmaster/pgarch.c          |   6 +-
 src/backend/replication/logical/origin.c |  23 +--
 src/backend/replication/slot.c           |   2 +-
 src/backend/storage/file/fd.c            | 267 +++++++++++++++++++++++--------
 src/include/storage/fd.h                 |   4 +-
 8 files changed, 232 insertions(+), 195 deletions(-)

diff --git a/src/backend/access/transam/timeline.c b/src/backend/access/transam/timeline.c
index f6da673..bd91573 100644
--- a/src/backend/access/transam/timeline.c
+++ b/src/backend/access/transam/timeline.c
@@ -418,24 +418,10 @@ writeTimeLineHistory(TimeLineID newTLI, TimeLineID parentTLI,
 	TLHistoryFilePath(path, newTLI);
 
 	/*
-	 * Prefer link() to rename() here just to be really sure that we don't
-	 * overwrite an existing file.  However, there shouldn't be one, so
-	 * rename() is an acceptable substitute except for the truly paranoid.
+	 * Perform the rename using link if available, paranoidly trying to avoid
+	 * overwriting an existing file (there shouldn't be one).
 	 */
-#if HAVE_WORKING_LINK
-	if (link(tmppath, path) < 0)
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not link file \"%s\" to \"%s\": %m",
-						tmppath, path)));
-	unlink(tmppath);
-#else
-	if (rename(tmppath, path) < 0)
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not rename file \"%s\" to \"%s\": %m",
-						tmppath, path)));
-#endif
+	durable_link_or_rename(tmppath, path, ERROR);
 
 	/* The history file can be archived immediately. */
 	if (XLogArchivingActive())
@@ -508,24 +494,10 @@ writeTimeLineHistoryFile(TimeLineID tli, char *content, int size)
 	TLHistoryFilePath(path, tli);
 
 	/*
-	 * Prefer link() to rename() here just to be really sure that we don't
-	 * overwrite an existing logfile.  However, there shouldn't be one, so
-	 * rename() is an acceptable substitute except for the truly paranoid.
+	 * Perform the rename using link if available, paranoidly trying to avoid
+	 * overwriting an existing file (there shouldn't be one).
 	 */
-#if HAVE_WORKING_LINK
-	if (link(tmppath, path) < 0)
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not link file \"%s\" to \"%s\": %m",
-						tmppath, path)));
-	unlink(tmppath);
-#else
-	if (rename(tmppath, path) < 0)
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not rename file \"%s\" to \"%s\": %m",
-						tmppath, path)));
-#endif
+	durable_link_or_rename(tmppath, path, ERROR);
 }
 
 /*
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 00f139a..2d63a54 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -3299,34 +3299,16 @@ InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
 	}
 
 	/*
-	 * Prefer link() to rename() here just to be really sure that we don't
-	 * overwrite an existing logfile.  However, there shouldn't be one, so
-	 * rename() is an acceptable substitute except for the truly paranoid.
+	 * Perform the rename using link if available, paranoidly trying to avoid
+	 * overwriting an existing file (there shouldn't be one).
 	 */
-#if HAVE_WORKING_LINK
-	if (link(tmppath, path) < 0)
+	if (durable_link_or_rename(tmppath, path, LOG) != 0)
 	{
 		if (use_lock)
 			LWLockRelease(ControlFileLock);
-		ereport(LOG,
-				(errcode_for_file_access(),
-				 errmsg("could not link file \"%s\" to \"%s\" (initialization of log file): %m",
-						tmppath, path)));
+		/* durable_link_or_rename already emitted log message */
 		return false;
 	}
-	unlink(tmppath);
-#else
-	if (rename(tmppath, path) < 0)
-	{
-		if (use_lock)
-			LWLockRelease(ControlFileLock);
-		ereport(LOG,
-				(errcode_for_file_access(),
-				 errmsg("could not rename file \"%s\" to \"%s\" (initialization of log file): %m",
-						tmppath, path)));
-		return false;
-	}
-#endif
 
 	if (use_lock)
 		LWLockRelease(ControlFileLock);
@@ -3840,14 +3822,8 @@ RemoveXlogFile(const char *segname, XLogRecPtr PriorRedoPtr, XLogRecPtr endptr)
 		 * flag, rename will fail. We'll try again at the next checkpoint.
 		 */
 		snprintf(newpath, MAXPGPATH, "%s.deleted", path);
-		if (rename(path, newpath) != 0)
-		{
-			ereport(LOG,
-					(errcode_for_file_access(),
-			   errmsg("could not rename old transaction log file \"%s\": %m",
-					  path)));
+		if (durable_rename(path, newpath, LOG) != 0)
 			return;
-		}
 		rc = unlink(newpath);
 #else
 		rc = unlink(path);
@@ -5339,11 +5315,7 @@ exitArchiveRecovery(TimeLineID endTLI, XLogRecPtr endOfLog)
 	 * re-enter archive recovery mode in a subsequent crash.
 	 */
 	unlink(RECOVERY_COMMAND_DONE);
-	if (rename(RECOVERY_COMMAND_FILE, RECOVERY_COMMAND_DONE) != 0)
-		ereport(FATAL,
-				(errcode_for_file_access(),
-				 errmsg("could not rename file \"%s\" to \"%s\": %m",
-						RECOVERY_COMMAND_FILE, RECOVERY_COMMAND_DONE)));
+	durable_rename(RECOVERY_COMMAND_FILE, RECOVERY_COMMAND_DONE, FATAL);
 
 	ereport(LOG,
 			(errmsg("archive recovery complete")));
@@ -6190,7 +6162,7 @@ StartupXLOG(void)
 		if (stat(TABLESPACE_MAP, &st) == 0)
 		{
 			unlink(TABLESPACE_MAP_OLD);
-			if (rename(TABLESPACE_MAP, TABLESPACE_MAP_OLD) == 0)
+			if (durable_rename(TABLESPACE_MAP, TABLESPACE_MAP_OLD, DEBUG1) == 0)
 				ereport(LOG,
 					(errmsg("ignoring file \"%s\" because no file \"%s\" exists",
 							TABLESPACE_MAP, BACKUP_LABEL_FILE),
@@ -6553,11 +6525,7 @@ StartupXLOG(void)
 		if (haveBackupLabel)
 		{
 			unlink(BACKUP_LABEL_OLD);
-			if (rename(BACKUP_LABEL_FILE, BACKUP_LABEL_OLD) != 0)
-				ereport(FATAL,
-						(errcode_for_file_access(),
-						 errmsg("could not rename file \"%s\" to \"%s\": %m",
-								BACKUP_LABEL_FILE, BACKUP_LABEL_OLD)));
+			durable_rename(BACKUP_LABEL_FILE, BACKUP_LABEL_OLD, FATAL);
 		}
 
 		/*
@@ -6570,11 +6538,7 @@ StartupXLOG(void)
 		if (haveTblspcMap)
 		{
 			unlink(TABLESPACE_MAP_OLD);
-			if (rename(TABLESPACE_MAP, TABLESPACE_MAP_OLD) != 0)
-				ereport(FATAL,
-						(errcode_for_file_access(),
-						 errmsg("could not rename file \"%s\" to \"%s\": %m",
-								TABLESPACE_MAP, TABLESPACE_MAP_OLD)));
+			durable_rename(TABLESPACE_MAP, TABLESPACE_MAP_OLD, FATAL);
 		}
 
 		/* Check that the GUCs used to generate the WAL allow recovery */
@@ -7351,11 +7315,7 @@ StartupXLOG(void)
 				 */
 				XLogArchiveCleanup(partialfname);
 
-				if (rename(origpath, partialpath) != 0)
-					ereport(ERROR,
-							(errcode_for_file_access(),
-						 errmsg("could not rename file \"%s\" to \"%s\": %m",
-								origpath, partialpath)));
+				durable_rename(origpath, partialpath, ERROR);
 				XLogArchiveNotify(partialfname);
 			}
 		}
@@ -10911,7 +10871,7 @@ CancelBackup(void)
 	/* remove leftover file from previously canceled backup if it exists */
 	unlink(BACKUP_LABEL_OLD);
 
-	if (rename(BACKUP_LABEL_FILE, BACKUP_LABEL_OLD) != 0)
+	if (durable_rename(BACKUP_LABEL_FILE, BACKUP_LABEL_OLD, DEBUG1) != 0)
 	{
 		ereport(WARNING,
 				(errcode_for_file_access(),
@@ -10934,7 +10894,7 @@ CancelBackup(void)
 	/* remove leftover file from previously canceled backup if it exists */
 	unlink(TABLESPACE_MAP_OLD);
 
-	if (rename(TABLESPACE_MAP, TABLESPACE_MAP_OLD) == 0)
+	if (durable_rename(TABLESPACE_MAP, TABLESPACE_MAP_OLD, DEBUG1) == 0)
 	{
 		ereport(LOG,
 				(errmsg("online backup mode canceled"),
diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c
index 277c14a..bcfc53f 100644
--- a/src/backend/access/transam/xlogarchive.c
+++ b/src/backend/access/transam/xlogarchive.c
@@ -451,13 +451,7 @@ KeepFileRestoredFromArchive(char *path, char *xlogfname)
 		 */
 		snprintf(oldpath, MAXPGPATH, "%s.deleted%u",
 				 xlogfpath, deletedcounter++);
-		if (rename(xlogfpath, oldpath) != 0)
-		{
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not rename file \"%s\" to \"%s\": %m",
-							xlogfpath, oldpath)));
-		}
+		durable_rename(xlogfpath, oldpath, ERROR);
 #else
 		/* same-size buffers, so this never truncates */
 		strlcpy(oldpath, xlogfpath, MAXPGPATH);
@@ -470,11 +464,7 @@ KeepFileRestoredFromArchive(char *path, char *xlogfname)
 		reload = true;
 	}
 
-	if (rename(path, xlogfpath) < 0)
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not rename file \"%s\" to \"%s\": %m",
-						path, xlogfpath)));
+	durable_rename(path, xlogfpath, ERROR);
 
 	/*
 	 * Create .done file forcibly to prevent the restored segment from being
@@ -580,12 +570,7 @@ XLogArchiveForceDone(const char *xlog)
 	StatusFilePath(archiveReady, xlog, ".ready");
 	if (stat(archiveReady, &stat_buf) == 0)
 	{
-		if (rename(archiveReady, archiveDone) < 0)
-			ereport(WARNING,
-					(errcode_for_file_access(),
-					 errmsg("could not rename file \"%s\" to \"%s\": %m",
-							archiveReady, archiveDone)));
-
+		(void) durable_rename(archiveReady, archiveDone, WARNING);
 		return;
 	}
 
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 397f802..1aa6466 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -728,9 +728,5 @@ pgarch_archiveDone(char *xlog)
 
 	StatusFilePath(rlogready, xlog, ".ready");
 	StatusFilePath(rlogdone, xlog, ".done");
-	if (rename(rlogready, rlogdone) < 0)
-		ereport(WARNING,
-				(errcode_for_file_access(),
-				 errmsg("could not rename file \"%s\" to \"%s\": %m",
-						rlogready, rlogdone)));
+	(void) durable_rename(rlogready, rlogdone, WARNING);
 }
diff --git a/src/backend/replication/logical/origin.c b/src/backend/replication/logical/origin.c
index 0caf7a3..8c8833b 100644
--- a/src/backend/replication/logical/origin.c
+++ b/src/backend/replication/logical/origin.c
@@ -604,29 +604,10 @@ CheckPointReplicationOrigin(void)
 						tmppath)));
 	}
 
-	/* fsync the temporary file */
-	if (pg_fsync(tmpfd) != 0)
-	{
-		CloseTransientFile(tmpfd);
-		ereport(PANIC,
-				(errcode_for_file_access(),
-				 errmsg("could not fsync file \"%s\": %m",
-						tmppath)));
-	}
-
 	CloseTransientFile(tmpfd);
 
-	/* rename to permanent file, fsync file and directory */
-	if (rename(tmppath, path) != 0)
-	{
-		ereport(PANIC,
-				(errcode_for_file_access(),
-				 errmsg("could not rename file \"%s\" to \"%s\": %m",
-						tmppath, path)));
-	}
-
-	fsync_fname((char *) path, false);
-	fsync_fname("pg_logical", true);
+	/* fsync, rename to permanent file, fsync file and directory */
+	durable_rename(tmppath, path, PANIC);
 }
 
 /*
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index affa9b9..ead221d 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1095,7 +1095,7 @@ SaveSlotToPath(ReplicationSlot *slot, const char *dir, int elevel)
 	START_CRIT_SECTION();
 
 	fsync_fname(path, false);
-	fsync_fname((char *) dir, true);
+	fsync_fname(dir, true);
 	fsync_fname("pg_replslot", true);
 
 	END_CRIT_SECTION();
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 1b30100..e3ccb8c 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -306,7 +306,10 @@ static void walkdir(const char *path,
 #ifdef PG_FLUSH_DATA_WORKS
 static void pre_sync_fname(const char *fname, bool isdir, int elevel);
 #endif
-static void fsync_fname_ext(const char *fname, bool isdir, int elevel);
+static void datadir_fsync_fname(const char *fname, bool isdir, int elevel);
+
+static int fsync_fname_ext(const char *fname, bool isdir, bool ignore_perm, int elevel);
+static int fsync_parent_path(const char *fname, int elevel);
 
 
 /*
@@ -413,54 +416,142 @@ pg_flush_data(int fd, off_t offset, off_t amount)
  * indicate the OS just doesn't allow/require fsyncing directories.
  */
 void
-fsync_fname(char *fname, bool isdir)
+fsync_fname(const char *fname, bool isdir)
 {
-	int			fd;
-	int			returncode;
-
-	/*
-	 * Some OSs require directories to be opened read-only whereas other
-	 * systems don't allow us to fsync files opened read-only; so we need both
-	 * cases here
-	 */
-	if (!isdir)
-		fd = OpenTransientFile(fname,
-							   O_RDWR | PG_BINARY,
-							   S_IRUSR | S_IWUSR);
-	else
-		fd = OpenTransientFile(fname,
-							   O_RDONLY | PG_BINARY,
-							   S_IRUSR | S_IWUSR);
-
-	/*
-	 * Some OSs don't allow us to open directories at all (Windows returns
-	 * EACCES)
-	 */
-	if (fd < 0 && isdir && (errno == EISDIR || errno == EACCES))
-		return;
-
-	else if (fd < 0)
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not open file \"%s\": %m", fname)));
-
-	returncode = pg_fsync(fd);
-
-	/* Some OSs don't allow us to fsync directories at all */
-	if (returncode != 0 && isdir && errno == EBADF)
-	{
-		CloseTransientFile(fd);
-		return;
-	}
-
-	if (returncode != 0)
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not fsync file \"%s\": %m", fname)));
-
-	CloseTransientFile(fd);
+	fsync_fname_ext(fname, isdir, false, ERROR);
 }
 
+/*
+ * durable_rename -- rename(2) wrapper, issuing fsyncs required for durability
+ *
+ * This routine ensures that, after returning, the effect of renaming file
+ * persists in case of a crash. A crash while this routine is running will
+ * leave you with either the old, or the new file.
+ *
+ * It does so by using fsync on the sourcefile before the rename, and the
+ * target file and directory after.
+ *
+ * Note that rename() cannot be used across arbitrary directories, as they
+ * might not be on the same filesystem. Therefore this routine does not
+ * support renaming across directories.
+ */
+int
+durable_rename(const char *oldfile, const char *newfile, int elevel)
+{
+	int		fd;
+
+	/*
+	 * First fsync the old and target path (if it exists), to ensure that they
+	 * are properly persistent on disk. Syncing the target file is not
+	 * strictly necessary, but it makes it easier to reason about crashes;
+	 * because it's then guaranteed that either source or target file exists
+	 * after a crash.
+	 */
+	if (fsync_fname_ext(oldfile, false, false, elevel) != 0)
+		return -1;
+
+	fd = OpenTransientFile((char *) newfile, PG_BINARY | O_RDWR, 0);
+	if (fd < 0)
+	{
+		if (errno != ENOENT)
+		{
+			ereport(elevel,
+					(errcode_for_file_access(),
+					 errmsg("could not open file \"%s\": %m", newfile)));
+			return -1;
+		}
+	}
+
+	if (pg_fsync(fd) != 0)
+	{
+		/* XXX: perform close() before? might be outside a transaction. Consider errno! */
+		ereport(elevel,
+				(errcode_for_file_access(),
+				 errmsg("could not fsync file \"%s\": %m",
+						newfile)));
+		CloseTransientFile(fd);
+		return -1;
+	}
+	CloseTransientFile(fd);
+
+	/* Time to do the real deal... */
+	if (rename(oldfile, newfile) < 0)
+	{
+		ereport(elevel,
+				(errcode_for_file_access(),
+				 errmsg("could not rename file \"%s\" to \"%s\": %m",
+						oldfile, newfile)));
+		return -1;
+	}
+
+	/*
+	 * To guarantee renaming the file is persistent, fsync the file with its
+	 * new name, and its containing directory.
+	 */
+	if (fsync_fname_ext(newfile, false, false, elevel) != 0)
+		return -1;
+
+	/* Same for parent directory */
+	if (fsync_parent_path(newfile, elevel) != 0)
+		return -1;
+
+	return 0;
+}
+
+/*
+ * durable_link_or_rename -- rename a file in a durable manner.
+ *
+ * Similar to durable_rename(), except that this routine tries (but does not
+ * guarantee) not to overwrite the target file.
+ *
+ * Note that a crash in an unfortunate moment can leave you with two links to
+ * the target file.
+ */
+int
+durable_link_or_rename(const char *oldfile, const char *newfile,  int elevel)
+{
+	/*
+	 * Ensure that, if we crash directly after the rename/link, a file with
+	 * valid contents is moved into place.
+	 */
+	if (fsync_fname_ext(oldfile, false, false, elevel) != 0)
+		return -1;
+
+#if HAVE_WORKING_LINK
+	if (link(oldfile, newfile) < 0)
+	{
+		ereport(elevel,
+				(errcode_for_file_access(),
+				 errmsg("could not link file \"%s\" to \"%s\": %m",
+						oldfile, newfile)));
+		return -1;
+	}
+	unlink(oldfile);
+#else
+	/* XXX: Add racy file existence check? */
+	if (rename(oldfile, newfile) < 0)
+	{
+		ereport(elevel,
+				(errcode_for_file_access(),
+				 errmsg("could not rename file \"%s\" to \"%s\": %m",
+						tmppath, path)));
+		return -1;
+	}
+#endif
+
+	/*
+	 * Make change persistent in case of an OS crash, both the new entry and
+	 * its parent directory need to be flushed.
+	 */
+	if (fsync_fname_ext(newfile, false, false, elevel) != 0)
+		return -1;
+
+	/* Same for parent directory */
+	if (fsync_parent_path(newfile, elevel) != 0)
+		return -1;
+
+	return 0;
+}
 
 /*
  * InitFileAccess --- initialize this module during backend startup
@@ -2547,10 +2638,10 @@ SyncDataDirectory(void)
 	 * in pg_tblspc, they'll get fsync'd twice.  That's not an expected case
 	 * so we don't worry about optimizing it.
 	 */
-	walkdir(".", fsync_fname_ext, false, LOG);
+	walkdir(".", datadir_fsync_fname, false, LOG);
 	if (xlog_is_symlink)
-		walkdir("pg_xlog", fsync_fname_ext, false, LOG);
-	walkdir("pg_tblspc", fsync_fname_ext, true, LOG);
+		walkdir("pg_xlog", datadir_fsync_fname, false, LOG);
+	walkdir("pg_tblspc", datadir_fsync_fname, true, LOG);
 }
 
 /*
@@ -2667,12 +2758,12 @@ pre_sync_fname(const char *fname, bool isdir, int elevel)
 /*
  * fsync_fname_ext -- Try to fsync a file or directory
  *
- * Ignores errors trying to open unreadable files, or trying to fsync
- * directories on systems where that isn't allowed/required, and logs other
- * errors at a caller-specified level.
+ * If desired ignores errors trying to open unreadable files, or trying to
+ * fsync directories on systems where that isn't allowed/required, and logs
+ * other errors at a caller-specified level.
  */
-static void
-fsync_fname_ext(const char *fname, bool isdir, int elevel)
+static int
+fsync_fname_ext(const char *fname, bool isdir, bool ignore_perm, int elevel)
 {
 	int			fd;
 	int			flags;
@@ -2690,20 +2781,23 @@ fsync_fname_ext(const char *fname, bool isdir, int elevel)
 	else
 		flags |= O_RDONLY;
 
-	/*
-	 * Open the file, silently ignoring errors about unreadable files (or
-	 * unsupported operations, e.g. opening a directory under Windows), and
-	 * logging others.
-	 */
 	fd = OpenTransientFile((char *) fname, flags, 0);
-	if (fd < 0)
+
+	/*
+	 * Some OSs don't allow us to open directories at all (Windows returns
+	 * EACCES), just ignore the error in that case.  If desired also silently
+	 * ignoring errors about unreadable files. Log others.
+	 */
+	if (fd < 0 && isdir && (errno == EISDIR || errno == EACCES))
+		return 0;
+	else if (fd < 0 && ignore_perm && errno == EACCES)
+		return 0;
+	else if (fd < 0)
 	{
-		if (errno == EACCES || (isdir && errno == EISDIR))
-			return;
 		ereport(elevel,
 				(errcode_for_file_access(),
 				 errmsg("could not open file \"%s\": %m", fname)));
-		return;
+		return -1;
 	}
 
 	returncode = pg_fsync(fd);
@@ -2713,9 +2807,56 @@ fsync_fname_ext(const char *fname, bool isdir, int elevel)
 	 * those errors. Anything else needs to be logged.
 	 */
 	if (returncode != 0 && !(isdir && errno == EBADF))
+	{
+		/* XXX: perform close() before? might be outside a transaction. Consider errno! */
 		ereport(elevel,
 				(errcode_for_file_access(),
 				 errmsg("could not fsync file \"%s\": %m", fname)));
+		(void) CloseTransientFile(fd);
+		return -1;
+	}
 
 	(void) CloseTransientFile(fd);
+
+	return 0;
+}
+
+/*
+ * fsync_parent_path -- fsync the parent path of a file or directory
+ *
+ * This is aimed at making file operations persistent on disk in case of
+ * an OS crash or power failure.
+ */
+static int
+fsync_parent_path(const char *fname, int elevel)
+{
+	char	parentpath[MAXPGPATH];
+
+	/* Same for parent directory */
+	snprintf(parentpath, MAXPGPATH, "%s", fname);
+	get_parent_directory(parentpath);
+
+	/*
+	 * get_parent_directory() returns an empty string if the input argument
+	 * is just a file name (see comments in path.c), so handle that as being
+	 * the current directory.
+	 */
+	if (strlen(parentpath) == 0)
+		sprintf(parentpath, ".");
+
+	if (fsync_fname_ext(parentpath, true, false, elevel) != 0)
+		return -1;
+
+	return 0;
+}
+
+static void
+datadir_fsync_fname(const char *fname, bool isdir, int elevel)
+{
+	/*
+	 * We want to silently ignoring errors about unreadable files (or
+	 * unsupported operations, e.g. opening a directory under Windows). Pass
+	 * that desire on to fsync_fname_ext().
+	 */
+	fsync_fname_ext(fname, isdir, true, elevel);
 }
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 4a3fccb..66dc5dc 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -113,7 +113,9 @@ extern int	pg_fsync_no_writethrough(int fd);
 extern int	pg_fsync_writethrough(int fd);
 extern int	pg_fdatasync(int fd);
 extern int	pg_flush_data(int fd, off_t offset, off_t amount);
-extern void fsync_fname(char *fname, bool isdir);
+extern void fsync_fname(const char *fname, bool isdir);
+extern int	durable_rename(const char *oldfile, const char *newfile, int loglevel);
+extern int	durable_link_or_rename(const char *oldfile, const char *newfile, int loglevel);
 extern void SyncDataDirectory(void);
 
 /* Filename components for OpenTemporaryFile */
-- 
2.7.0.229.g701fa7f

#65Michael Paquier
michael.paquier@gmail.com
In reply to: Andres Freund (#64)
Re: silent data loss with ext4 / all current versions

On Mon, Mar 7, 2016 at 3:38 PM, Andres Freund <andres@anarazel.de> wrote:

On 2016-03-05 19:54:05 -0800, Andres Freund wrote:

I started working on this; delayed by taking longer than planned on the
logical decoding stuff (quite a bit complicated by
e1a11d93111ff3fba7a91f3f2ac0b0aca16909a8). I'm not very happy with the
error handling as it is right now. For one, you have rename_safe return
error codes, and act on them in the callers, but on the other hand you
call fsync_fname which always errors out in case of failure. I also
don't like the new messages much.

Will continue working on this tomorrow.

So, here's my current version of this. I've not performed any testing
yet, and it's hot of the press. There's some comment smithing
needed. But otherwise I'm starting to like this.

Changes:
* renamed rename_safe to durable_rename
* renamed replace_safe to durable_link_or_rename (there was never any
replacing going on)
* pass through elevel to the underlying routines, otherwise we could
error out with ERROR when we don't want to. That's particularly
important in case of things like InstallXLogFileSegment().
* made fsync_fname use fsync_fname_ext, add 'accept permission errors'
param
* have walkdir call a wrapper, to add ignore_perms param

What do you think?

I have spent a couple of hours looking at that in details, and the
patch is neat.

+ * This routine ensures that, after returning, the effect of renaming file
+ * persists in case of a crash. A crash while this routine is running will
+ * leave you with either the old, or the new file.
"you" is not really Postgres-like, "the server" or "the backend" perhaps?
+       /* XXX: perform close() before? might be outside a
transaction. Consider errno! */
        ereport(elevel,
                (errcode_for_file_access(),
                 errmsg("could not fsync file \"%s\": %m", fname)));
+       (void) CloseTransientFile(fd);
+       return -1;
close() should be called before. slot.c for example does so and we
don't want to link an fd here in case of elevel >= ERROR.
+ * It does so by using fsync on the sourcefile before the rename, and the
+ * target file and directory after.
fsync is issued as well on the target file if it exists. I think
that's worth mentioning in the header.
+   /* XXX: Add racy file existence check? */
+   if (rename(oldfile, newfile) < 0)
I am not sure we should worry about that, what do you think could
cause the old file from going missing all of a sudden. Other backend
processes are not playing with it in the code paths where this routine
is called. Perhaps adding a comment in the header to let users know
that would help?

Instead of "durable" I think that "persistent" makes more sense. We
want to make those renames persistent on disk on case of a crash. So I
would suggest the following routine names:
- rename_persistent
- rename_or_link_persistent
Having the verb first also helps identifying that this is a
system-level, rename()-like, routine.

I sure wish we had the recovery testing et al. in all the back
branches...

Sure, what we have now is something that should really be backpatched,
I was just waiting to have all the existing stability issues
addressed, the last one on my agenda being the failure of hamster for
test 005 I mentioned in another thread before sending patches for
other branches. I proposed a couple of potential regarding that
actually, see here:
/messages/by-id/CAB7nPqSAZ9HnUcMoUa30JO2wJ8MnREm18p2a7McRA-ZrJxj3Vw@mail.gmail.com
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#66Andres Freund
andres@anarazel.de
In reply to: Michael Paquier (#65)
Re: silent data loss with ext4 / all current versions

Hi,

On 2016-03-08 12:01:18 +0900, Michael Paquier wrote:

I have spent a couple of hours looking at that in details, and the
patch is neat.

Cool. Doing some more polishing right now. Will be back with an updated
version soonish.

Did you do some testing?

+ * This routine ensures that, after returning, the effect of renaming file
+ * persists in case of a crash. A crash while this routine is running will
+ * leave you with either the old, or the new file.

"you" is not really Postgres-like, "the server" or "the backend" perhaps?

Hm. I think your alternative proposals are more awkward.

+       /* XXX: perform close() before? might be outside a
transaction. Consider errno! */
ereport(elevel,
(errcode_for_file_access(),
errmsg("could not fsync file \"%s\": %m", fname)));
+       (void) CloseTransientFile(fd);
+       return -1;
close() should be called before. slot.c for example does so and we
don't want to link an fd here in case of elevel >= ERROR.

Note that the transient file machinery will normally prevent fd leakage
- but it can only do so if called in a transaction context. I've added
int save_errno;

/* close file upon error, might not be in transaction context */
save_errno = errno;
CloseTransientFile(fd);
errno = save_errno;
stanzas.

+ * It does so by using fsync on the sourcefile before the rename, and the
+ * target file and directory after.

fsync is issued as well on the target file if it exists. I think
that's worth mentioning in the header.

Ok.

+   /* XXX: Add racy file existence check? */
+   if (rename(oldfile, newfile) < 0)

I am not sure we should worry about that, what do you think could
cause the old file from going missing all of a sudden. Other backend
processes are not playing with it in the code paths where this routine
is called. Perhaps adding a comment in the header to let users know
that would help?

What I'm thinking of is adding a check whether the *target* file already
exists, and error out in that case. Just like the link() based path
normally does.

Instead of "durable" I think that "persistent" makes more sense.

I find durable a lot more descriptive. persistent could refer to
retrying the rename or something.

We
want to make those renames persistent on disk on case of a crash. So I
would suggest the following routine names:
- rename_persistent
- rename_or_link_persistent
Having the verb first also helps identifying that this is a
system-level, rename()-like, routine.

I prefer the current names.

I sure wish we had the recovery testing et al. in all the back
branches...

Sure, what we have now is something that should really be backpatched,
I was just waiting to have all the existing stability issues
addressed, the last one on my agenda being the failure of hamster for
test 005 I mentioned in another thread before sending patches for
other branches. I proposed a couple of potential regarding that
actually, see here:
/messages/by-id/CAB7nPqSAZ9HnUcMoUa30JO2wJ8MnREm18p2a7McRA-ZrJxj3Vw@mail.gmail.com

Yea. Will be an interesting discussion... Anyway, I did run the patch
through the existing checks, after enabling fsync in PostgresNode.pm.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#67Michael Paquier
michael.paquier@gmail.com
In reply to: Andres Freund (#66)
Re: silent data loss with ext4 / all current versions

On Tue, Mar 8, 2016 at 12:18 PM, Andres Freund <andres@anarazel.de> wrote:

On 2016-03-08 12:01:18 +0900, Michael Paquier wrote:

I have spent a couple of hours looking at that in details, and the
patch is neat.

Cool. Doing some more polishing right now. Will be back with an updated
version soonish.

Did you do some testing?

Not much in details yet, I just ran a check-world with fsync enabled
for the recovery tests, plus quick manual tests with a cluster
manually set up. I'll do more with your new version now that I know
there will be one.

+   /* XXX: Add racy file existence check? */
+   if (rename(oldfile, newfile) < 0)

I am not sure we should worry about that, what do you think could
cause the old file from going missing all of a sudden. Other backend
processes are not playing with it in the code paths where this routine
is called. Perhaps adding a comment in the header to let users know
that would help?

What I'm thinking of is adding a check whether the *target* file already
exists, and error out in that case. Just like the link() based path
normally does.

Ah, OK. Well, why not. I'd rather have an assertion instead of an error though.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#68Andres Freund
andres@anarazel.de
In reply to: Michael Paquier (#67)
2 attachment(s)
Re: silent data loss with ext4 / all current versions

On 2016-03-08 12:26:34 +0900, Michael Paquier wrote:

On Tue, Mar 8, 2016 at 12:18 PM, Andres Freund <andres@anarazel.de> wrote:

On 2016-03-08 12:01:18 +0900, Michael Paquier wrote:

I have spent a couple of hours looking at that in details, and the
patch is neat.

Cool. Doing some more polishing right now. Will be back with an updated
version soonish.

Did you do some testing?

Not much in details yet, I just ran a check-world with fsync enabled
for the recovery tests, plus quick manual tests with a cluster
manually set up. I'll do more with your new version now that I know
there will be one.

Here's my updated version.

Note that I've split the patch into two. One for the infrastructure, and
one for the callsites.

+   /* XXX: Add racy file existence check? */
+   if (rename(oldfile, newfile) < 0)

I am not sure we should worry about that, what do you think could
cause the old file from going missing all of a sudden. Other backend
processes are not playing with it in the code paths where this routine
is called. Perhaps adding a comment in the header to let users know
that would help?

What I'm thinking of is adding a check whether the *target* file already
exists, and error out in that case. Just like the link() based path
normally does.

Ah, OK. Well, why not. I'd rather have an assertion instead of an error though.

I think it should definitely be an error if anything. But I'd rather
only add it in master...

Andres

Attachments:

0001-Introduce-durable_rename-and-durable_link_or_rename.patchtext/x-patch; charset=us-asciiDownload
From 9dc71e059cc50d57e7f4f42c68b1c4afa07279a3 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 7 Mar 2016 15:04:17 -0800
Subject: [PATCH 1/2] Introduce durable_rename() and durable_link_or_rename().

Renaming a file using rename(2) is not guaranteed to be durable in face
of crashes. To be certain that a rename() atomically replaces the
previous file contents in the face of crashes and different filesystems,
one has to fsync the old filename, rename the file, fsync the new
filename, fsync the containing directory.  This sequence is not
correctly adhered to currently; which exposes us to data loss risks. To
avoid having to repeat this arduous sequence, introduce
durable_rename(), which wraps all that.

Also add durable_link_or_rename(). Several places use link() (with a
fallback to rename()) to rename a file, trying to avoid replacing the
target file out of paranoia. Some of those rename sequences need to be
durable as well.

This commit does not yet make use of the new functions; they're used in
a followup commit.

Author: Michael Paquier, Andres Freund
Discussion: 56583BDD.9060302@2ndquadrant.com
Backpatch: All supported branches
---
 src/backend/replication/slot.c |   2 +-
 src/backend/storage/file/fd.c  | 287 ++++++++++++++++++++++++++++++++---------
 src/include/storage/fd.h       |   4 +-
 3 files changed, 228 insertions(+), 65 deletions(-)

diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index affa9b9..ead221d 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1095,7 +1095,7 @@ SaveSlotToPath(ReplicationSlot *slot, const char *dir, int elevel)
 	START_CRIT_SECTION();
 
 	fsync_fname(path, false);
-	fsync_fname((char *) dir, true);
+	fsync_fname(dir, true);
 	fsync_fname("pg_replslot", true);
 
 	END_CRIT_SECTION();
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 1b30100..c9f9b7d 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -306,7 +306,10 @@ static void walkdir(const char *path,
 #ifdef PG_FLUSH_DATA_WORKS
 static void pre_sync_fname(const char *fname, bool isdir, int elevel);
 #endif
-static void fsync_fname_ext(const char *fname, bool isdir, int elevel);
+static void datadir_fsync_fname(const char *fname, bool isdir, int elevel);
+
+static int	fsync_fname_ext(const char *fname, bool isdir, bool ignore_perm, int elevel);
+static int	fsync_parent_path(const char *fname, int elevel);
 
 
 /*
@@ -413,54 +416,158 @@ pg_flush_data(int fd, off_t offset, off_t amount)
  * indicate the OS just doesn't allow/require fsyncing directories.
  */
 void
-fsync_fname(char *fname, bool isdir)
+fsync_fname(const char *fname, bool isdir)
 {
-	int			fd;
-	int			returncode;
-
-	/*
-	 * Some OSs require directories to be opened read-only whereas other
-	 * systems don't allow us to fsync files opened read-only; so we need both
-	 * cases here
-	 */
-	if (!isdir)
-		fd = OpenTransientFile(fname,
-							   O_RDWR | PG_BINARY,
-							   S_IRUSR | S_IWUSR);
-	else
-		fd = OpenTransientFile(fname,
-							   O_RDONLY | PG_BINARY,
-							   S_IRUSR | S_IWUSR);
-
-	/*
-	 * Some OSs don't allow us to open directories at all (Windows returns
-	 * EACCES)
-	 */
-	if (fd < 0 && isdir && (errno == EISDIR || errno == EACCES))
-		return;
-
-	else if (fd < 0)
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not open file \"%s\": %m", fname)));
-
-	returncode = pg_fsync(fd);
-
-	/* Some OSs don't allow us to fsync directories at all */
-	if (returncode != 0 && isdir && errno == EBADF)
-	{
-		CloseTransientFile(fd);
-		return;
-	}
-
-	if (returncode != 0)
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not fsync file \"%s\": %m", fname)));
-
-	CloseTransientFile(fd);
+	fsync_fname_ext(fname, isdir, false, ERROR);
 }
 
+/*
+ * durable_rename -- rename(2) wrapper, issuing fsyncs required for durability
+ *
+ * This routine ensures that, after returning, the effect of renaming file
+ * persists in case of a crash. A crash while this routine is running will
+ * leave you with either the old, or the new file.
+ *
+ * It does so by using fsync on the sourcefile and the possibly existing
+ * targetfile before the rename, and the target file and directory after.
+ *
+ * Note that rename() cannot be used across arbitrary directories, as they
+ * might not be on the same filesystem. Therefore this routine does not
+ * support renaming across directories.
+ *
+ * Log errors with the caller specified severity.
+ *
+ * Returns 0 if the operation succeeded, -1 otherwise. Note that errno is not
+ * valid upon return.
+ */
+int
+durable_rename(const char *oldfile, const char *newfile, int elevel)
+{
+	int			fd;
+
+	/*
+	 * First fsync the old and target path (if it exists), to ensure that they
+	 * are properly persistent on disk. Syncing the target file is not
+	 * strictly necessary, but it makes it easier to reason about crashes;
+	 * because it's then guaranteed that either source or target file exists
+	 * after a crash.
+	 */
+	if (fsync_fname_ext(oldfile, false, false, elevel) != 0)
+		return -1;
+
+	fd = OpenTransientFile((char *) newfile, PG_BINARY | O_RDWR, 0);
+	if (fd < 0)
+	{
+		if (errno != ENOENT)
+		{
+			ereport(elevel,
+					(errcode_for_file_access(),
+					 errmsg("could not open file \"%s\": %m", newfile)));
+			return -1;
+		}
+	}
+	else
+	{
+		if (pg_fsync(fd) != 0)
+		{
+			int			save_errno;
+
+			/* close file upon error, might not be in transaction context */
+			save_errno = errno;
+			CloseTransientFile(fd);
+			errno = save_errno;
+
+			ereport(elevel,
+					(errcode_for_file_access(),
+					 errmsg("could not fsync file \"%s\": %m", newfile)));
+			return -1;
+		}
+		CloseTransientFile(fd);
+	}
+
+	/* Time to do the real deal... */
+	if (rename(oldfile, newfile) < 0)
+	{
+		ereport(elevel,
+				(errcode_for_file_access(),
+				 errmsg("could not rename file \"%s\" to \"%s\": %m",
+						oldfile, newfile)));
+		return -1;
+	}
+
+	/*
+	 * To guarantee renaming the file is persistent, fsync the file with its
+	 * new name, and its containing directory.
+	 */
+	if (fsync_fname_ext(newfile, false, false, elevel) != 0)
+		return -1;
+
+	/* Same for parent directory */
+	if (fsync_parent_path(newfile, elevel) != 0)
+		return -1;
+
+	return 0;
+}
+
+/*
+ * durable_link_or_rename -- rename a file in a durable manner.
+ *
+ * Similar to durable_rename(), except that this routine tries (but does not
+ * guarantee) not to overwrite the target file.
+ *
+ * Note that a crash in an unfortunate moment can leave you with two links to
+ * the target file.
+ *
+ * Log errors with the caller specified severity.
+ *
+ * Returns 0 if the operation succeeded, -1 otherwise. Note that errno is not
+ * valid upon return.
+ */
+int
+durable_link_or_rename(const char *oldfile, const char *newfile, int elevel)
+{
+	/*
+	 * Ensure that, if we crash directly after the rename/link, a file with
+	 * valid contents is moved into place.
+	 */
+	if (fsync_fname_ext(oldfile, false, false, elevel) != 0)
+		return -1;
+
+#if HAVE_WORKING_LINK
+	if (link(oldfile, newfile) < 0)
+	{
+		ereport(elevel,
+				(errcode_for_file_access(),
+				 errmsg("could not link file \"%s\" to \"%s\": %m",
+						oldfile, newfile)));
+		return -1;
+	}
+	unlink(oldfile);
+#else
+	/* XXX: Add racy file existence check? */
+	if (rename(oldfile, newfile) < 0)
+	{
+		ereport(elevel,
+				(errcode_for_file_access(),
+				 errmsg("could not rename file \"%s\" to \"%s\": %m",
+						tmppath, path)));
+		return -1;
+	}
+#endif
+
+	/*
+	 * Make change persistent in case of an OS crash, both the new entry and
+	 * its parent directory need to be flushed.
+	 */
+	if (fsync_fname_ext(newfile, false, false, elevel) != 0)
+		return -1;
+
+	/* Same for parent directory */
+	if (fsync_parent_path(newfile, elevel) != 0)
+		return -1;
+
+	return 0;
+}
 
 /*
  * InitFileAccess --- initialize this module during backend startup
@@ -2547,10 +2654,10 @@ SyncDataDirectory(void)
 	 * in pg_tblspc, they'll get fsync'd twice.  That's not an expected case
 	 * so we don't worry about optimizing it.
 	 */
-	walkdir(".", fsync_fname_ext, false, LOG);
+	walkdir(".", datadir_fsync_fname, false, LOG);
 	if (xlog_is_symlink)
-		walkdir("pg_xlog", fsync_fname_ext, false, LOG);
-	walkdir("pg_tblspc", fsync_fname_ext, true, LOG);
+		walkdir("pg_xlog", datadir_fsync_fname, false, LOG);
+	walkdir("pg_tblspc", datadir_fsync_fname, true, LOG);
 }
 
 /*
@@ -2664,15 +2771,26 @@ pre_sync_fname(const char *fname, bool isdir, int elevel)
 
 #endif   /* PG_FLUSH_DATA_WORKS */
 
+static void
+datadir_fsync_fname(const char *fname, bool isdir, int elevel)
+{
+	/*
+	 * We want to silently ignoring errors about unreadable files.  Pass that
+	 * desire on to fsync_fname_ext().
+	 */
+	fsync_fname_ext(fname, isdir, true, elevel);
+}
+
 /*
  * fsync_fname_ext -- Try to fsync a file or directory
  *
- * Ignores errors trying to open unreadable files, or trying to fsync
- * directories on systems where that isn't allowed/required, and logs other
- * errors at a caller-specified level.
+ * If ignore_perm is true, ignore errors upon trying to open unreadable
+ * files. Logs other errors at a caller-specified level.
+ *
+ * Returns 0 if the operation succeeded, -1 otherwise.
  */
-static void
-fsync_fname_ext(const char *fname, bool isdir, int elevel)
+static int
+fsync_fname_ext(const char *fname, bool isdir, bool ignore_perm, int elevel)
 {
 	int			fd;
 	int			flags;
@@ -2690,20 +2808,23 @@ fsync_fname_ext(const char *fname, bool isdir, int elevel)
 	else
 		flags |= O_RDONLY;
 
-	/*
-	 * Open the file, silently ignoring errors about unreadable files (or
-	 * unsupported operations, e.g. opening a directory under Windows), and
-	 * logging others.
-	 */
 	fd = OpenTransientFile((char *) fname, flags, 0);
-	if (fd < 0)
+
+	/*
+	 * Some OSs don't allow us to open directories at all (Windows returns
+	 * EACCES), just ignore the error in that case.  If desired also silently
+	 * ignoring errors about unreadable files. Log others.
+	 */
+	if (fd < 0 && isdir && (errno == EISDIR || errno == EACCES))
+		return 0;
+	else if (fd < 0 && ignore_perm && errno == EACCES)
+		return 0;
+	else if (fd < 0)
 	{
-		if (errno == EACCES || (isdir && errno == EISDIR))
-			return;
 		ereport(elevel,
 				(errcode_for_file_access(),
 				 errmsg("could not open file \"%s\": %m", fname)));
-		return;
+		return -1;
 	}
 
 	returncode = pg_fsync(fd);
@@ -2713,9 +2834,49 @@ fsync_fname_ext(const char *fname, bool isdir, int elevel)
 	 * those errors. Anything else needs to be logged.
 	 */
 	if (returncode != 0 && !(isdir && errno == EBADF))
+	{
+		int			save_errno;
+
+		/* close file upon error, might not be in transaction context */
+		save_errno = errno;
+		(void) CloseTransientFile(fd);
+		errno = save_errno;
+
 		ereport(elevel,
 				(errcode_for_file_access(),
 				 errmsg("could not fsync file \"%s\": %m", fname)));
+		return -1;
+	}
 
 	(void) CloseTransientFile(fd);
+
+	return 0;
+}
+
+/*
+ * fsync_parent_path -- fsync the parent path of a file or directory
+ *
+ * This is aimed at making file operations persistent on disk in case of
+ * an OS crash or power failure.
+ */
+static int
+fsync_parent_path(const char *fname, int elevel)
+{
+	char		parentpath[MAXPGPATH];
+
+	strlcpy(parentpath, fname, MAXPGPATH);
+	get_parent_directory(parentpath);
+
+	/*
+	 * get_parent_directory() returns an empty string if the input argument is
+	 * just a file name (see comments in path.c), so handle that as being the
+	 * current directory.
+	 */
+	if (strlen(parentpath) == 0)
+		strlcpy(parentpath, ".", MAXPGPATH);
+
+	if (fsync_fname_ext(parentpath, true, false, elevel) != 0)
+		return -1;
+
+	return 0;
 }
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 4a3fccb..66dc5dc 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -113,7 +113,9 @@ extern int	pg_fsync_no_writethrough(int fd);
 extern int	pg_fsync_writethrough(int fd);
 extern int	pg_fdatasync(int fd);
 extern int	pg_flush_data(int fd, off_t offset, off_t amount);
-extern void fsync_fname(char *fname, bool isdir);
+extern void fsync_fname(const char *fname, bool isdir);
+extern int	durable_rename(const char *oldfile, const char *newfile, int loglevel);
+extern int	durable_link_or_rename(const char *oldfile, const char *newfile, int loglevel);
 extern void SyncDataDirectory(void);
 
 /* Filename components for OpenTemporaryFile */
-- 
2.7.0.229.g701fa7f

0002-Avoid-unlikely-data-loss-scenarios-due-to-rename-wit.patchtext/x-patch; charset=us-asciiDownload
From 34861d06b02090c176b45e9be2ea39969fc8a9f8 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 7 Mar 2016 18:00:56 -0800
Subject: [PATCH 2/2] Avoid unlikely data-loss scenarios due to rename()
 without fsync.

Renaming a file using rename(2) is not guaranteed to be durable in face
of crashes. Use the previously added
durable_rename()/durable_link_or_rename() in various places where we
previously just renamed files.

Most of the changed callsites are arguably not critical, but
e.g. "loosing" a recycled WAL file due to a crash, can corrupt the
entire cluster.

Reported-By: Tomas Vondra
Author: Michael Paquier, Tomas Vondra, Andres Freund
Discussion: 56583BDD.9060302@2ndquadrant.com
Backpatch: All supported branches
---
 contrib/pg_stat_statements/pg_stat_statements.c |  6 +--
 src/backend/access/transam/timeline.c           | 40 +++-------------
 src/backend/access/transam/xlog.c               | 64 +++++--------------------
 src/backend/access/transam/xlogarchive.c        | 21 ++------
 src/backend/postmaster/pgarch.c                 |  6 +--
 src/backend/replication/logical/origin.c        | 23 +--------
 src/backend/utils/misc/guc.c                    |  6 +--
 7 files changed, 26 insertions(+), 140 deletions(-)

diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index dffc477..9ce60e6 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -741,11 +741,7 @@ pgss_shmem_shutdown(int code, Datum arg)
 	/*
 	 * Rename file into place, so we atomically replace any old one.
 	 */
-	if (rename(PGSS_DUMP_FILE ".tmp", PGSS_DUMP_FILE) != 0)
-		ereport(LOG,
-				(errcode_for_file_access(),
-				 errmsg("could not rename pg_stat_statement file \"%s\": %m",
-						PGSS_DUMP_FILE ".tmp")));
+	(void) durable_rename(PGSS_DUMP_FILE ".tmp", PGSS_DUMP_FILE, LOG);
 
 	/* Unlink query-texts file; it's not needed while shutdown */
 	unlink(PGSS_TEXT_FILE);
diff --git a/src/backend/access/transam/timeline.c b/src/backend/access/transam/timeline.c
index f6da673..bd91573 100644
--- a/src/backend/access/transam/timeline.c
+++ b/src/backend/access/transam/timeline.c
@@ -418,24 +418,10 @@ writeTimeLineHistory(TimeLineID newTLI, TimeLineID parentTLI,
 	TLHistoryFilePath(path, newTLI);
 
 	/*
-	 * Prefer link() to rename() here just to be really sure that we don't
-	 * overwrite an existing file.  However, there shouldn't be one, so
-	 * rename() is an acceptable substitute except for the truly paranoid.
+	 * Perform the rename using link if available, paranoidly trying to avoid
+	 * overwriting an existing file (there shouldn't be one).
 	 */
-#if HAVE_WORKING_LINK
-	if (link(tmppath, path) < 0)
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not link file \"%s\" to \"%s\": %m",
-						tmppath, path)));
-	unlink(tmppath);
-#else
-	if (rename(tmppath, path) < 0)
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not rename file \"%s\" to \"%s\": %m",
-						tmppath, path)));
-#endif
+	durable_link_or_rename(tmppath, path, ERROR);
 
 	/* The history file can be archived immediately. */
 	if (XLogArchivingActive())
@@ -508,24 +494,10 @@ writeTimeLineHistoryFile(TimeLineID tli, char *content, int size)
 	TLHistoryFilePath(path, tli);
 
 	/*
-	 * Prefer link() to rename() here just to be really sure that we don't
-	 * overwrite an existing logfile.  However, there shouldn't be one, so
-	 * rename() is an acceptable substitute except for the truly paranoid.
+	 * Perform the rename using link if available, paranoidly trying to avoid
+	 * overwriting an existing file (there shouldn't be one).
 	 */
-#if HAVE_WORKING_LINK
-	if (link(tmppath, path) < 0)
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not link file \"%s\" to \"%s\": %m",
-						tmppath, path)));
-	unlink(tmppath);
-#else
-	if (rename(tmppath, path) < 0)
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not rename file \"%s\" to \"%s\": %m",
-						tmppath, path)));
-#endif
+	durable_link_or_rename(tmppath, path, ERROR);
 }
 
 /*
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 00f139a..2d63a54 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -3299,34 +3299,16 @@ InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
 	}
 
 	/*
-	 * Prefer link() to rename() here just to be really sure that we don't
-	 * overwrite an existing logfile.  However, there shouldn't be one, so
-	 * rename() is an acceptable substitute except for the truly paranoid.
+	 * Perform the rename using link if available, paranoidly trying to avoid
+	 * overwriting an existing file (there shouldn't be one).
 	 */
-#if HAVE_WORKING_LINK
-	if (link(tmppath, path) < 0)
+	if (durable_link_or_rename(tmppath, path, LOG) != 0)
 	{
 		if (use_lock)
 			LWLockRelease(ControlFileLock);
-		ereport(LOG,
-				(errcode_for_file_access(),
-				 errmsg("could not link file \"%s\" to \"%s\" (initialization of log file): %m",
-						tmppath, path)));
+		/* durable_link_or_rename already emitted log message */
 		return false;
 	}
-	unlink(tmppath);
-#else
-	if (rename(tmppath, path) < 0)
-	{
-		if (use_lock)
-			LWLockRelease(ControlFileLock);
-		ereport(LOG,
-				(errcode_for_file_access(),
-				 errmsg("could not rename file \"%s\" to \"%s\" (initialization of log file): %m",
-						tmppath, path)));
-		return false;
-	}
-#endif
 
 	if (use_lock)
 		LWLockRelease(ControlFileLock);
@@ -3840,14 +3822,8 @@ RemoveXlogFile(const char *segname, XLogRecPtr PriorRedoPtr, XLogRecPtr endptr)
 		 * flag, rename will fail. We'll try again at the next checkpoint.
 		 */
 		snprintf(newpath, MAXPGPATH, "%s.deleted", path);
-		if (rename(path, newpath) != 0)
-		{
-			ereport(LOG,
-					(errcode_for_file_access(),
-			   errmsg("could not rename old transaction log file \"%s\": %m",
-					  path)));
+		if (durable_rename(path, newpath, LOG) != 0)
 			return;
-		}
 		rc = unlink(newpath);
 #else
 		rc = unlink(path);
@@ -5339,11 +5315,7 @@ exitArchiveRecovery(TimeLineID endTLI, XLogRecPtr endOfLog)
 	 * re-enter archive recovery mode in a subsequent crash.
 	 */
 	unlink(RECOVERY_COMMAND_DONE);
-	if (rename(RECOVERY_COMMAND_FILE, RECOVERY_COMMAND_DONE) != 0)
-		ereport(FATAL,
-				(errcode_for_file_access(),
-				 errmsg("could not rename file \"%s\" to \"%s\": %m",
-						RECOVERY_COMMAND_FILE, RECOVERY_COMMAND_DONE)));
+	durable_rename(RECOVERY_COMMAND_FILE, RECOVERY_COMMAND_DONE, FATAL);
 
 	ereport(LOG,
 			(errmsg("archive recovery complete")));
@@ -6190,7 +6162,7 @@ StartupXLOG(void)
 		if (stat(TABLESPACE_MAP, &st) == 0)
 		{
 			unlink(TABLESPACE_MAP_OLD);
-			if (rename(TABLESPACE_MAP, TABLESPACE_MAP_OLD) == 0)
+			if (durable_rename(TABLESPACE_MAP, TABLESPACE_MAP_OLD, DEBUG1) == 0)
 				ereport(LOG,
 					(errmsg("ignoring file \"%s\" because no file \"%s\" exists",
 							TABLESPACE_MAP, BACKUP_LABEL_FILE),
@@ -6553,11 +6525,7 @@ StartupXLOG(void)
 		if (haveBackupLabel)
 		{
 			unlink(BACKUP_LABEL_OLD);
-			if (rename(BACKUP_LABEL_FILE, BACKUP_LABEL_OLD) != 0)
-				ereport(FATAL,
-						(errcode_for_file_access(),
-						 errmsg("could not rename file \"%s\" to \"%s\": %m",
-								BACKUP_LABEL_FILE, BACKUP_LABEL_OLD)));
+			durable_rename(BACKUP_LABEL_FILE, BACKUP_LABEL_OLD, FATAL);
 		}
 
 		/*
@@ -6570,11 +6538,7 @@ StartupXLOG(void)
 		if (haveTblspcMap)
 		{
 			unlink(TABLESPACE_MAP_OLD);
-			if (rename(TABLESPACE_MAP, TABLESPACE_MAP_OLD) != 0)
-				ereport(FATAL,
-						(errcode_for_file_access(),
-						 errmsg("could not rename file \"%s\" to \"%s\": %m",
-								TABLESPACE_MAP, TABLESPACE_MAP_OLD)));
+			durable_rename(TABLESPACE_MAP, TABLESPACE_MAP_OLD, FATAL);
 		}
 
 		/* Check that the GUCs used to generate the WAL allow recovery */
@@ -7351,11 +7315,7 @@ StartupXLOG(void)
 				 */
 				XLogArchiveCleanup(partialfname);
 
-				if (rename(origpath, partialpath) != 0)
-					ereport(ERROR,
-							(errcode_for_file_access(),
-						 errmsg("could not rename file \"%s\" to \"%s\": %m",
-								origpath, partialpath)));
+				durable_rename(origpath, partialpath, ERROR);
 				XLogArchiveNotify(partialfname);
 			}
 		}
@@ -10911,7 +10871,7 @@ CancelBackup(void)
 	/* remove leftover file from previously canceled backup if it exists */
 	unlink(BACKUP_LABEL_OLD);
 
-	if (rename(BACKUP_LABEL_FILE, BACKUP_LABEL_OLD) != 0)
+	if (durable_rename(BACKUP_LABEL_FILE, BACKUP_LABEL_OLD, DEBUG1) != 0)
 	{
 		ereport(WARNING,
 				(errcode_for_file_access(),
@@ -10934,7 +10894,7 @@ CancelBackup(void)
 	/* remove leftover file from previously canceled backup if it exists */
 	unlink(TABLESPACE_MAP_OLD);
 
-	if (rename(TABLESPACE_MAP, TABLESPACE_MAP_OLD) == 0)
+	if (durable_rename(TABLESPACE_MAP, TABLESPACE_MAP_OLD, DEBUG1) == 0)
 	{
 		ereport(LOG,
 				(errmsg("online backup mode canceled"),
diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c
index 277c14a..bcfc53f 100644
--- a/src/backend/access/transam/xlogarchive.c
+++ b/src/backend/access/transam/xlogarchive.c
@@ -451,13 +451,7 @@ KeepFileRestoredFromArchive(char *path, char *xlogfname)
 		 */
 		snprintf(oldpath, MAXPGPATH, "%s.deleted%u",
 				 xlogfpath, deletedcounter++);
-		if (rename(xlogfpath, oldpath) != 0)
-		{
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not rename file \"%s\" to \"%s\": %m",
-							xlogfpath, oldpath)));
-		}
+		durable_rename(xlogfpath, oldpath, ERROR);
 #else
 		/* same-size buffers, so this never truncates */
 		strlcpy(oldpath, xlogfpath, MAXPGPATH);
@@ -470,11 +464,7 @@ KeepFileRestoredFromArchive(char *path, char *xlogfname)
 		reload = true;
 	}
 
-	if (rename(path, xlogfpath) < 0)
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not rename file \"%s\" to \"%s\": %m",
-						path, xlogfpath)));
+	durable_rename(path, xlogfpath, ERROR);
 
 	/*
 	 * Create .done file forcibly to prevent the restored segment from being
@@ -580,12 +570,7 @@ XLogArchiveForceDone(const char *xlog)
 	StatusFilePath(archiveReady, xlog, ".ready");
 	if (stat(archiveReady, &stat_buf) == 0)
 	{
-		if (rename(archiveReady, archiveDone) < 0)
-			ereport(WARNING,
-					(errcode_for_file_access(),
-					 errmsg("could not rename file \"%s\" to \"%s\": %m",
-							archiveReady, archiveDone)));
-
+		(void) durable_rename(archiveReady, archiveDone, WARNING);
 		return;
 	}
 
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 397f802..1aa6466 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -728,9 +728,5 @@ pgarch_archiveDone(char *xlog)
 
 	StatusFilePath(rlogready, xlog, ".ready");
 	StatusFilePath(rlogdone, xlog, ".done");
-	if (rename(rlogready, rlogdone) < 0)
-		ereport(WARNING,
-				(errcode_for_file_access(),
-				 errmsg("could not rename file \"%s\" to \"%s\": %m",
-						rlogready, rlogdone)));
+	(void) durable_rename(rlogready, rlogdone, WARNING);
 }
diff --git a/src/backend/replication/logical/origin.c b/src/backend/replication/logical/origin.c
index 0caf7a3..8c8833b 100644
--- a/src/backend/replication/logical/origin.c
+++ b/src/backend/replication/logical/origin.c
@@ -604,29 +604,10 @@ CheckPointReplicationOrigin(void)
 						tmppath)));
 	}
 
-	/* fsync the temporary file */
-	if (pg_fsync(tmpfd) != 0)
-	{
-		CloseTransientFile(tmpfd);
-		ereport(PANIC,
-				(errcode_for_file_access(),
-				 errmsg("could not fsync file \"%s\": %m",
-						tmppath)));
-	}
-
 	CloseTransientFile(tmpfd);
 
-	/* rename to permanent file, fsync file and directory */
-	if (rename(tmppath, path) != 0)
-	{
-		ereport(PANIC,
-				(errcode_for_file_access(),
-				 errmsg("could not rename file \"%s\" to \"%s\": %m",
-						tmppath, path)));
-	}
-
-	fsync_fname((char *) path, false);
-	fsync_fname("pg_logical", true);
+	/* fsync, rename to permanent file, fsync file and directory */
+	durable_rename(tmppath, path, PANIC);
 }
 
 /*
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index ea5a09a..0be64a1 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -7037,11 +7037,7 @@ AlterSystemSetConfigFile(AlterSystemStmt *altersysstmt)
 		 * at worst it can lose the parameters set by last ALTER SYSTEM
 		 * command.
 		 */
-		if (rename(AutoConfTmpFileName, AutoConfFileName) < 0)
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not rename file \"%s\" to \"%s\": %m",
-							AutoConfTmpFileName, AutoConfFileName)));
+		durable_rename(AutoConfTmpFileName, AutoConfFileName, ERROR);
 	}
 	PG_CATCH();
 	{
-- 
2.7.0.229.g701fa7f

#69Michael Paquier
michael.paquier@gmail.com
In reply to: Andres Freund (#68)
Re: silent data loss with ext4 / all current versions

On Tue, Mar 8, 2016 at 2:55 PM, Andres Freund <andres@anarazel.de> wrote:

On 2016-03-08 12:26:34 +0900, Michael Paquier wrote:

On Tue, Mar 8, 2016 at 12:18 PM, Andres Freund <andres@anarazel.de> wrote:

On 2016-03-08 12:01:18 +0900, Michael Paquier wrote:

I have spent a couple of hours looking at that in details, and the
patch is neat.

Cool. Doing some more polishing right now. Will be back with an updated
version soonish.

Did you do some testing?

Not much in details yet, I just ran a check-world with fsync enabled
for the recovery tests, plus quick manual tests with a cluster
manually set up. I'll do more with your new version now that I know
there will be one.

Here's my updated version.

Note that I've split the patch into two. One for the infrastructure, and
one for the callsites.

Thanks for the updated patches and the split, this makes things easier
to look at. I have been doing some testing as well mainly manually
using with pgbench and nothing looks broken.

+   durable_link_or_rename(tmppath, path, ERROR);
+   durable_rename(path, xlogfpath, ERROR);
You may want to add a (void) cast in front of those calls for correctness.

- ereport(LOG,
- (errcode_for_file_access(),
- errmsg("could not link file \"%s\" to \"%s\"
(initialization of log file): %m",
- tmppath, path)));
We lose a portion of the error message here, but with the file name
that's easy to guess where that is happening. I am not complaining
(that's fine to me as-is), just mentioning for the archive's sake.

+   /* XXX: Add racy file existence check? */
+   if (rename(oldfile, newfile) < 0)

I am not sure we should worry about that, what do you think could
cause the old file from going missing all of a sudden. Other backend
processes are not playing with it in the code paths where this routine
is called. Perhaps adding a comment in the header to let users know
that would help?

What I'm thinking of is adding a check whether the *target* file already
exists, and error out in that case. Just like the link() based path
normally does.

Ah, OK. Well, why not. I'd rather have an assertion instead of an error though.

I think it should definitely be an error if anything. But I'd rather
only add it in master...

I guess I know why :) That's also why I was thinking about an assertion.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#70Andres Freund
andres@anarazel.de
In reply to: Michael Paquier (#69)
Re: silent data loss with ext4 / all current versions

Hi,

On 2016-03-08 16:21:45 +0900, Michael Paquier wrote:

+   durable_link_or_rename(tmppath, path, ERROR);
+   durable_rename(path, xlogfpath, ERROR);

You may want to add a (void) cast in front of those calls for correctness.

"correctness"? This is neatnikism, not correctness. I've actually added
(void)'s to the sites that return on error (i.e. pass LOG or something),
but not the ones where we pass ERROR.

- ereport(LOG,
- (errcode_for_file_access(),
- errmsg("could not link file \"%s\" to \"%s\"
(initialization of log file): %m",
- tmppath, path)));
We lose a portion of the error message here, but with the file name
that's easy to guess where that is happening. I am not complaining
(that's fine to me as-is), just mentioning for the archive's sake.

Yea, I think that's fine too.

- Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#71Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#66)
Re: silent data loss with ext4 / all current versions

On Mon, Mar 7, 2016 at 10:18 PM, Andres Freund <andres@anarazel.de> wrote:

Instead of "durable" I think that "persistent" makes more sense.

I find durable a lot more descriptive. persistent could refer to
retrying the rename or something.

Yeah, I like durable, too.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#72Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Andres Freund (#68)
Re: silent data loss with ext4 / all current versions

Hi,

On Mon, 2016-03-07 at 21:55 -0800, Andres Freund wrote:

On 2016-03-08 12:26:34 +0900, Michael Paquier wrote:

On Tue, Mar 8, 2016 at 12:18 PM, Andres Freund <andres@anarazel.de> wrote:

On 2016-03-08 12:01:18 +0900, Michael Paquier wrote:

I have spent a couple of hours looking at that in details, and the
patch is neat.

Cool. Doing some more polishing right now. Will be back with an updated
version soonish.

Did you do some testing?

Not much in details yet, I just ran a check-world with fsync enabled
for the recovery tests, plus quick manual tests with a cluster
manually set up. I'll do more with your new version now that I know
there will be one.

Here's my updated version.

Note that I've split the patch into two. One for the infrastructure, and
one for the callsites.

I've repeated the power-loss testing today. With the patches applied I'm
not longer able to reproduce the issue (despite trying about 10x), while
without them I've hit it on the first try. This is on kernel 4.4.2.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#73Andres Freund
andres@anarazel.de
In reply to: Tomas Vondra (#72)
Re: silent data loss with ext4 / all current versions

On 2016-03-08 23:47:48 +0100, Tomas Vondra wrote:

I've repeated the power-loss testing today. With the patches applied I'm
not longer able to reproduce the issue (despite trying about 10x), while
without them I've hit it on the first try. This is on kernel 4.4.2.

Yay, thanks for testing!

Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#74Joshua D. Drake
jd@commandprompt.com
In reply to: Robert Haas (#71)
Re: silent data loss with ext4 / all current versions

On 03/08/2016 02:16 PM, Robert Haas wrote:

On Mon, Mar 7, 2016 at 10:18 PM, Andres Freund <andres@anarazel.de> wrote:

Instead of "durable" I think that "persistent" makes more sense.

I find durable a lot more descriptive. persistent could refer to
retrying the rename or something.

Yeah, I like durable, too.

There is also precedent, DURABLE as in aciD

JD

--
Command Prompt, Inc. http://the.postgres.company/
+1-503-667-4564
PostgreSQL Centered full stack support, consulting and development.
Everyone appreciates your honesty, until you are honest with them.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#75Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#68)
Re: silent data loss with ext4 / all current versions

On 2016-03-07 21:55:52 -0800, Andres Freund wrote:

Here's my updated version.

Note that I've split the patch into two. One for the infrastructure, and
one for the callsites.

I've finally pushed these, after making a number of mostly cosmetic
fixes. The only of real consequence is that I've removed the durable_*
call from the renames to .deleted in xlog[archive].c - these don't need
to be durable, and are windows only. Oh, and that there was a typo in
the !HAVE_WORKING_LINK case.

There's a *lot* of version skew here: not-present functionality, moved
files, different APIs - we got it all. I've tried to check in each
version whether we're missing fsyncs for renames and everything.
Michael, *please* double check the diffs for the different branches.

Note that we currently have some frontend programs with the equivalent
problem. Most importantly receivelog.c (pg_basebackup/pg_recveivexlog)
are missing pretty much the same directory fsyncs. And at least for
pg_recvxlog it's critical, especially now that receivexlog support
syncrep. I've not done anything about that; there's pretty much no
chance to share backend code with the frontend in the back-branches.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#76Michael Paquier
michael.paquier@gmail.com
In reply to: Andres Freund (#75)
3 attachment(s)
Re: silent data loss with ext4 / all current versions

On Thu, Mar 10, 2016 at 4:25 AM, Andres Freund wrote:

I've finally pushed these, after making a number of mostly cosmetic
fixes. The only of real consequence is that I've removed the durable_*
call from the renames to .deleted in xlog[archive].c - these don't need
to be durable, and are windows only. Oh, and that there was a typo in
the !HAVE_WORKING_LINK case.

There's a *lot* of version skew here: not-present functionality, moved
files, different APIs - we got it all. I've tried to check in each
version whether we're missing fsyncs for renames and everything.
Michael, *please* double check the diffs for the different branches.

I have finally been able to spend some time reviewing what you pushed
on back-branches, and things are in correct shape I think. One small
issue that I have is that for EXEC_BACKEND builds, in
write_nondefault_variables we still use one instance of rename(). I
cannot really believe that there are production builds of Postgres
with EXEC_BACKEND on non-Windows platforms, but I think that we had
better cover our backs in this code path. For the other extra 2 calls
of rename() in xlog.c and xlogarchive.c, those are fine untouched I
think there is no need to care about WIN32 blocks...

Note that we currently have some frontend programs with the equivalent
problem. Most importantly receivelog.c (pg_basebackup/pg_recveivexlog)
are missing pretty much the same directory fsyncs. And at least for
pg_recvxlog it's critical, especially now that receivexlog support
syncrep. I've not done anything about that; there's pretty much no
chance to share backend code with the frontend in the back-branches.

Yeah, true. We definitely need to do something for that, even for HEAD
it seems like an overkill to have something in for example src/common
to allow frontends to have something if the fix is localized
(pg_rewind may use something else), and it would be nice to finish
wrapping that for the next minor release, so I propose the attached
patches. At the same time, I think that adminpack had better be fixed
as well, so there are actually three patches in this series, things
that I shaped thinking about a backpatch btw, particularly for 0002.
--
Michael

Attachments:

0001-Make-rename-calls-for-log-files-in-adminpack-durable.patchtext/x-patch; charset=US-ASCII; name=0001-Make-rename-calls-for-log-files-in-adminpack-durable.patchDownload
From 1a6a73565c36bac45b84ddf1b9718062c13d69cd Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@otacoo.com>
Date: Tue, 15 Mar 2016 14:50:48 +0100
Subject: [PATCH 1/3] Make rename() calls for log files in adminpack durable

As mentioned in 1d4a0ab1, rename() is not a durable operation in case
of crashes, causing renames to be potentially lost in such unfortunate
scenarios. The functions of adminpack are not critical code paths so they
do not induce any data loss, still it may be annoying for the upper
application layer like pgadmin to see inconsistent log files should a
server restart after a crash.
---
 contrib/adminpack/adminpack.c | 41 ++++++++++++-----------------------------
 1 file changed, 12 insertions(+), 29 deletions(-)

diff --git a/contrib/adminpack/adminpack.c b/contrib/adminpack/adminpack.c
index ea781a0..9136b79 100644
--- a/contrib/adminpack/adminpack.c
+++ b/contrib/adminpack/adminpack.c
@@ -209,44 +209,27 @@ pg_file_rename(PG_FUNCTION_ARGS)
 						fn3 ? fn3 : fn2)));
 	}
 
+	/*
+	 * Should a third file name be defined, use it as a temporary switch
+	 * that allows reverting back to the initial point should an error
+	 * occur.
+	 */
 	if (fn3)
 	{
-		if (rename(fn2, fn3) != 0)
-		{
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not rename \"%s\" to \"%s\": %m",
-							fn2, fn3)));
-		}
-		if (rename(fn1, fn2) != 0)
-		{
-			ereport(WARNING,
-					(errcode_for_file_access(),
-					 errmsg("could not rename \"%s\" to \"%s\": %m",
-							fn1, fn2)));
+		/* durable_rename produces already a log entry */
+		durable_rename(fn2, fn3, ERROR);
 
-			if (rename(fn3, fn2) != 0)
-			{
-				ereport(ERROR,
-						(errcode_for_file_access(),
-						 errmsg("could not rename \"%s\" back to \"%s\": %m",
-								fn3, fn2)));
-			}
-			else
-			{
+		if (durable_rename(fn1, fn2, WARNING) != 0)
+		{
+			if (durable_rename(fn3, fn2, ERROR) == 0)
 				ereport(ERROR,
 						(ERRCODE_UNDEFINED_FILE,
 						 errmsg("renaming \"%s\" to \"%s\" was reverted",
 								fn2, fn3)));
-			}
 		}
 	}
-	else if (rename(fn1, fn2) != 0)
-	{
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not rename \"%s\" to \"%s\": %m", fn1, fn2)));
-	}
+	else
+		durable_rename(fn1, fn2, ERROR);
 
 	PG_RETURN_BOOL(true);
 }
-- 
2.7.3

0002-Avoid-potential-data-loss-in-pg_receivexlog.patchtext/x-patch; charset=US-ASCII; name=0002-Avoid-potential-data-loss-in-pg_receivexlog.patchDownload
From 6d90b86685aa5cf3010d1edb3d91f8d0144c252f Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@otacoo.com>
Date: Tue, 15 Mar 2016 15:34:04 +0100
Subject: [PATCH 2/3] Avoid potential data loss in pg_receivexlog

pg_receivexlog makes use of rename() for timeline history files as well
as for completed WAL segments. However, this is not reliable and may cause
the rename operation to be lost in case of crashes. This commit makes use
of a similar function to backend's durable_name to make the renaming operation
durable on disk.
---
 src/bin/pg_basebackup/receivelog.c | 93 +++++++++++++++++++++++++++++++++++---
 1 file changed, 87 insertions(+), 6 deletions(-)

diff --git a/src/bin/pg_basebackup/receivelog.c b/src/bin/pg_basebackup/receivelog.c
index 595213f..b809582 100644
--- a/src/bin/pg_basebackup/receivelog.c
+++ b/src/bin/pg_basebackup/receivelog.c
@@ -50,6 +50,7 @@ static long CalculateCopyStreamSleeptime(int64 now, int standby_message_timeout,
 
 static bool ReadEndOfStreamingResult(PGresult *res, XLogRecPtr *startpos,
 						 uint32 *timeline);
+static int	durable_rename(const char *oldfile, const char *newfile);
 
 static bool
 mark_file_as_archived(const char *basedir, const char *fname)
@@ -217,10 +218,9 @@ close_walfile(StreamCtl *stream, XLogRecPtr pos)
 
 		snprintf(oldfn, sizeof(oldfn), "%s/%s%s", stream->basedir, current_walfile_name, stream->partial_suffix);
 		snprintf(newfn, sizeof(newfn), "%s/%s", stream->basedir, current_walfile_name);
-		if (rename(oldfn, newfn) != 0)
+		if (durable_rename(oldfn, newfn) != 0)
 		{
-			fprintf(stderr, _("%s: could not rename file \"%s\": %s\n"),
-					progname, current_walfile_name, strerror(errno));
+			/* durable_rename produced a log entry */
 			return false;
 		}
 	}
@@ -356,10 +356,9 @@ writeTimeLineHistoryFile(StreamCtl *stream, char *filename, char *content)
 	/*
 	 * Now move the completed history file into place with its final name.
 	 */
-	if (rename(tmppath, path) < 0)
+	if (durable_rename(tmppath, path) < 0)
 	{
-		fprintf(stderr, _("%s: could not rename file \"%s\" to \"%s\": %s\n"),
-				progname, tmppath, path, strerror(errno));
+		/* durable_rename produced a log entry */
 		return false;
 	}
 
@@ -786,6 +785,88 @@ ReadEndOfStreamingResult(PGresult *res, XLogRecPtr *startpos, uint32 *timeline)
 }
 
 /*
+ * Wrapper of rename() similar to the backend version with the same function
+ * name aimed at making the renaming durable on disk. Note that this version
+ * does not fsync the old file before the rename as all the code paths leading
+ * to this function are already doing this operation. The new file is also
+ * normally not present on disk before the renaming so there is no need to
+ * bother about it.
+ */
+static int
+durable_rename(const char *oldfile, const char *newfile)
+{
+	int		fd;
+	char	parentpath[MAXPGPATH];
+
+	if (rename(oldfile, newfile) != 0)
+	{
+		/* durable_rename produced a log entry */
+		fprintf(stderr, _("%s: could not rename file \"%s\": %s\n"),
+				progname, current_walfile_name, strerror(errno));
+		return -1;
+	}
+
+	/*
+	 * To guarantee renaming of the file is persistent, fsync the file with its
+	 * new name, and its containing directory.
+	 */
+	fd = open(newfile, O_RDWR | PG_BINARY, S_IRUSR | S_IWUSR);
+	if (fd < 0)
+	{
+		fprintf(stderr, _("%s: could not open file \"%s\": %s\n"),
+				progname, newfile, strerror(errno));
+		return -1;
+	}
+
+	if (fsync(fd) != 0)
+	{
+		close(fd);
+		fprintf(stderr, _("%s: could not fsync file \"%s\": %s\n"),
+				progname, newfile, strerror(errno));
+		return -1;
+	}
+	close(fd);
+
+	strlcpy(parentpath, newfile, MAXPGPATH);
+	get_parent_directory(parentpath);
+
+	/*
+	 * get_parent_directory() returns an empty string if the input argument is
+	 * just a file name (see comments in path.c), so handle that as being the
+	 * current directory.
+	 */
+	if (strlen(parentpath) == 0)
+		strlcpy(parentpath, ".", MAXPGPATH);
+
+	fd = open(parentpath, O_RDONLY | PG_BINARY, S_IRUSR | S_IWUSR);
+
+	/*
+	 * Some OSs don't allow us to open directories at all (Windows returns
+	 * EACCES), just ignore the error in that case.  If desired also silently
+	 * ignoring errors about unreadable files. Log others.
+	 */
+	if (fd < 0 && (errno == EISDIR || errno == EACCES))
+		return 0;
+	else if (fd < 0)
+	{
+		fprintf(stderr, _("%s: could not open file \"%s\": %s\n"),
+				progname, parentpath, strerror(errno));
+		return -1;
+	}
+
+	if (fsync(fd) != 0)
+	{
+		close(fd);
+		fprintf(stderr, _("%s: could not fsync file \"%s\": %s\n"),
+				progname, parentpath, strerror(errno));
+		return -1;
+	}
+	close(fd);
+
+	return 0;
+}
+
+/*
  * The main loop of ReceiveXlogStream. Handles the COPY stream after
  * initiating streaming with the START_STREAMING command.
  *
-- 
2.7.3

0003-Avoid-potential-lost-rename-of-new-parameter-file-in.patchtext/x-patch; charset=US-ASCII; name=0003-Avoid-potential-lost-rename-of-new-parameter-file-in.patchDownload
From ec81809e7b373feacd91418e850c44757fbaa1fe Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@otacoo.com>
Date: Tue, 15 Mar 2016 15:37:13 +0100
Subject: [PATCH 3/3] Avoid potential lost rename() of new parameter file in
 EXEC_BACKEND builds

guc.c is making use of rename(), though there are risks to lose the renaming
in case of a crash. Hence make use of durable_rename to make things durable.
---
 src/backend/utils/misc/guc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index edcafce..79e52d8 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -8656,7 +8656,7 @@ write_nondefault_variables(GucContext context)
 	 * Put new file in place.  This could delay on Win32, but we don't hold
 	 * any exclusive locks.
 	 */
-	rename(CONFIG_EXEC_PARAMS_NEW, CONFIG_EXEC_PARAMS);
+	(void) durable_rename(CONFIG_EXEC_PARAMS_NEW, CONFIG_EXEC_PARAMS, DEBUG1);
 }
 
 
-- 
2.7.3

#77Andres Freund
andres@anarazel.de
In reply to: Michael Paquier (#76)
Re: silent data loss with ext4 / all current versions

On 2016-03-15 15:39:50 +0100, Michael Paquier wrote:

I have finally been able to spend some time reviewing what you pushed
on back-branches, and things are in correct shape I think. One small
issue that I have is that for EXEC_BACKEND builds, in
write_nondefault_variables we still use one instance of rename().

Correctly so afaics, because write_nondefault_variables is by definition
non-durable. We write that stuff at every start / SIGHUP. Adding an
fsync there would be an unnecessary slowdown. I don't think it's good
policy adding fsync for stuff that definitely doesn't need it.

Yeah, true. We definitely need to do something for that, even for HEAD
it seems like an overkill to have something in for example src/common
to allow frontends to have something if the fix is localized
(pg_rewind may use something else), and it would be nice to finish
wrapping that for the next minor release, so I propose the attached
patches. At the same time, I think that adminpack had better be fixed
as well, so there are actually three patches in this series, things
that I shaped thinking about a backpatch btw, particularly for 0002.

I'm doubtful about "fixing" adminpack. We don't know how it's used, and
this could *seriously* increase its overhead for something potentially
used at a high rate.

/*
+ * Wrapper of rename() similar to the backend version with the same function
+ * name aimed at making the renaming durable on disk. Note that this version
+ * does not fsync the old file before the rename as all the code paths leading
+ * to this function are already doing this operation. The new file is also
+ * normally not present on disk before the renaming so there is no need to
+ * bother about it.
+ */
+static int
+durable_rename(const char *oldfile, const char *newfile)
+{
+	int		fd;
+	char	parentpath[MAXPGPATH];
+
+	if (rename(oldfile, newfile) != 0)
+	{
+		/* durable_rename produced a log entry */
+		fprintf(stderr, _("%s: could not rename file \"%s\": %s\n"),
+				progname, current_walfile_name, strerror(errno));
+		return -1;
+	}
+
+	/*
+	 * To guarantee renaming of the file is persistent, fsync the file with its
+	 * new name, and its containing directory.
+	 */
+	fd = open(newfile, O_RDWR | PG_BINARY, S_IRUSR | S_IWUSR);

Why is S_IRUSR | S_IWUSR specified here?

Are you working on a fix for pg_rewind? Let's go with initdb -S in a
first iteration, then we can, if somebody is interest enough, work on
making this nicer in master.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#78David Steele
david@pgmasters.net
In reply to: Michael Paquier (#76)
Re: silent data loss with ext4 / all current versions

On 3/15/16 10:39 AM, Michael Paquier wrote:

On Thu, Mar 10, 2016 at 4:25 AM, Andres Freund wrote:

Note that we currently have some frontend programs with the equivalent
problem. Most importantly receivelog.c (pg_basebackup/pg_recveivexlog)
are missing pretty much the same directory fsyncs. And at least for
pg_recvxlog it's critical, especially now that receivexlog support
syncrep. I've not done anything about that; there's pretty much no
chance to share backend code with the frontend in the back-branches.

Yeah, true. We definitely need to do something for that, even for HEAD
it seems like an overkill to have something in for example src/common
to allow frontends to have something if the fix is localized
(pg_rewind may use something else), and it would be nice to finish
wrapping that for the next minor release, so I propose the attached
patches.

I noticed this when reviewing the pg_receive_xlog refactor and was going
to submit a patch after the CF. It didn't occur to me to piggyback on
this work but I think it makes sense.

+1 from me for fixing this in pg_receivexlog and back-patching.

--
-David
david@pgmasters.net

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#79Michael Paquier
michael.paquier@gmail.com
In reply to: Andres Freund (#77)
3 attachment(s)
Re: silent data loss with ext4 / all current versions

On Wed, Mar 16, 2016 at 2:46 AM, Andres Freund <andres@anarazel.de> wrote:

On 2016-03-15 15:39:50 +0100, Michael Paquier wrote:

Yeah, true. We definitely need to do something for that, even for HEAD
it seems like an overkill to have something in for example src/common
to allow frontends to have something if the fix is localized
(pg_rewind may use something else), and it would be nice to finish
wrapping that for the next minor release, so I propose the attached
patches. At the same time, I think that adminpack had better be fixed
as well, so there are actually three patches in this series, things
that I shaped thinking about a backpatch btw, particularly for 0002.

I'm doubtful about "fixing" adminpack. We don't know how it's used, and
this could *seriously* increase its overhead for something potentially
used at a high rate.

I think that Dave or Guillaume added here in CC could bring some light
on the matter. Let's see if that's a problem for them. I would tend to
think that it is not that critical, still I would imagine that this
function is not called at a high frequency.

/*
+ * Wrapper of rename() similar to the backend version with the same function
+ * name aimed at making the renaming durable on disk. Note that this version
+ * does not fsync the old file before the rename as all the code paths leading
+ * to this function are already doing this operation. The new file is also
+ * normally not present on disk before the renaming so there is no need to
+ * bother about it.
+ */
+static int
+durable_rename(const char *oldfile, const char *newfile)
+{
+     int             fd;
+     char    parentpath[MAXPGPATH];
+
+     if (rename(oldfile, newfile) != 0)
+     {
+             /* durable_rename produced a log entry */
+             fprintf(stderr, _("%s: could not rename file \"%s\": %s\n"),
+                             progname, current_walfile_name, strerror(errno));
+             return -1;
+     }
+
+     /*
+      * To guarantee renaming of the file is persistent, fsync the file with its
+      * new name, and its containing directory.
+      */
+     fd = open(newfile, O_RDWR | PG_BINARY, S_IRUSR | S_IWUSR);

Why is S_IRUSR | S_IWUSR specified here?

Oops. I have removed it and updated that as attached.

Are you working on a fix for pg_rewind? Let's go with initdb -S in a
first iteration, then we can, if somebody is interest enough, work on
making this nicer in master.

I am really -1 for this approach. Wrapping initdb -S with
find_other_exec is intrusive in back-branches knowing that all the I/O
write operations manipulating file descriptors go through file_ops.c,
and we actually just need to fsync the target file in
close_target_file(), making the fix being a 7-line patch, and there is
no need to depend on another binary at all. The backup label file, as
well as the control file are using the same code path in file_ops.c...
And I like simple things :)

At the same time, I found a legit bug when the modified backup_label
file is created in createBackupLabel: the file is opened, written, but
not closed with close_target_file(), and it should be.
--
Michael

Attachments:

0001-Close-file-descriptor-associated-to-backup_label-cor.patchbinary/octet-stream; name=0001-Close-file-descriptor-associated-to-backup_label-cor.patchDownload
From 785663decda969a202957802289784e311cf0b15 Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@otacoo.com>
Date: Thu, 17 Mar 2016 22:30:42 +0900
Subject: [PATCH 1/3] Close file descriptor associated to backup_label
 correctly

The file descriptor used to generate the backup_label file was correctly
opened and written to, however it was never closed, causing a leak.
---
 src/bin/pg_rewind/pg_rewind.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/src/bin/pg_rewind/pg_rewind.c b/src/bin/pg_rewind/pg_rewind.c
index 96a42f8..c5589b9 100644
--- a/src/bin/pg_rewind/pg_rewind.c
+++ b/src/bin/pg_rewind/pg_rewind.c
@@ -584,6 +584,7 @@ createBackupLabel(XLogRecPtr startpoint, TimeLineID starttli, XLogRecPtr checkpo
 	/* TODO: move old file out of the way, if any. */
 	open_target_file("backup_label", true);		/* BACKUP_LABEL_FILE */
 	write_target_range(buf, 0, len);
+	close_target_file();
 }
 
 /*
-- 
2.7.3

0002-fsync-properly-files-modified-by-pg_rewind.patchbinary/octet-stream; name=0002-fsync-properly-files-modified-by-pg_rewind.patchDownload
From 08b05ac2dea84aa0de4d018065d69009186d70be Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@otacoo.com>
Date: Thu, 17 Mar 2016 22:46:56 +0900
Subject: [PATCH 2/3] fsync properly files modified by pg_rewind

Files updated by pg_rewind may have their changes lost in case of crashes
if those are not flushed correctly to disk, making a potential PGDATA
directory corrupted.
---
 src/bin/pg_rewind/file_ops.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/src/bin/pg_rewind/file_ops.c b/src/bin/pg_rewind/file_ops.c
index 32eab3a..51cdf2b 100644
--- a/src/bin/pg_rewind/file_ops.c
+++ b/src/bin/pg_rewind/file_ops.c
@@ -74,12 +74,18 @@ close_target_file(void)
 	if (dstfd == -1)
 		return;
 
+	if (fsync(dstfd) != 0)
+	{
+		close(dstfd);
+		pg_fatal("could not fsync file \"%s\": %s\n",
+				 dstpath, strerror(errno));
+	}
+
 	if (close(dstfd) != 0)
 		pg_fatal("could not close target file \"%s\": %s\n",
 				 dstpath, strerror(errno));
 
 	dstfd = -1;
-	/* fsync? */
 }
 
 void
-- 
2.7.3

0003-Avoid-potential-data-loss-in-pg_receivexlog.patchbinary/octet-stream; name=0003-Avoid-potential-data-loss-in-pg_receivexlog.patchDownload
From 969774614d7759436672ecf7fe2b1e2ac0f85dfd Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@otacoo.com>
Date: Thu, 17 Mar 2016 23:01:47 +0900
Subject: [PATCH 3/3] Avoid potential data loss in pg_receivexlog

pg_receivexlog makes use of rename() for timeline history files as well
as for completed WAL segments. However, this is not reliable and may cause
the rename operation to be lost in case of crashes. This commit makes use
of a similar function to backend's durable_name to make the renaming operation
durable on disk.
---
 src/bin/pg_basebackup/receivelog.c | 93 +++++++++++++++++++++++++++++++++++---
 1 file changed, 87 insertions(+), 6 deletions(-)

diff --git a/src/bin/pg_basebackup/receivelog.c b/src/bin/pg_basebackup/receivelog.c
index 595213f..cf9af83 100644
--- a/src/bin/pg_basebackup/receivelog.c
+++ b/src/bin/pg_basebackup/receivelog.c
@@ -50,6 +50,7 @@ static long CalculateCopyStreamSleeptime(int64 now, int standby_message_timeout,
 
 static bool ReadEndOfStreamingResult(PGresult *res, XLogRecPtr *startpos,
 						 uint32 *timeline);
+static int	durable_rename(const char *oldfile, const char *newfile);
 
 static bool
 mark_file_as_archived(const char *basedir, const char *fname)
@@ -217,10 +218,9 @@ close_walfile(StreamCtl *stream, XLogRecPtr pos)
 
 		snprintf(oldfn, sizeof(oldfn), "%s/%s%s", stream->basedir, current_walfile_name, stream->partial_suffix);
 		snprintf(newfn, sizeof(newfn), "%s/%s", stream->basedir, current_walfile_name);
-		if (rename(oldfn, newfn) != 0)
+		if (durable_rename(oldfn, newfn) != 0)
 		{
-			fprintf(stderr, _("%s: could not rename file \"%s\": %s\n"),
-					progname, current_walfile_name, strerror(errno));
+			/* durable_rename produced a log entry */
 			return false;
 		}
 	}
@@ -356,10 +356,9 @@ writeTimeLineHistoryFile(StreamCtl *stream, char *filename, char *content)
 	/*
 	 * Now move the completed history file into place with its final name.
 	 */
-	if (rename(tmppath, path) < 0)
+	if (durable_rename(tmppath, path) < 0)
 	{
-		fprintf(stderr, _("%s: could not rename file \"%s\" to \"%s\": %s\n"),
-				progname, tmppath, path, strerror(errno));
+		/* durable_rename produced a log entry */
 		return false;
 	}
 
@@ -786,6 +785,88 @@ ReadEndOfStreamingResult(PGresult *res, XLogRecPtr *startpos, uint32 *timeline)
 }
 
 /*
+ * Wrapper of rename() similar to the backend version with the same function
+ * name aimed at making the renaming durable on disk. Note that this version
+ * does not fsync the old file before the rename as all the code paths leading
+ * to this function are already doing this operation. The new file is also
+ * normally not present on disk before the renaming so there is no need to
+ * bother about it.
+ */
+static int
+durable_rename(const char *oldfile, const char *newfile)
+{
+	int		fd;
+	char	parentpath[MAXPGPATH];
+
+	if (rename(oldfile, newfile) != 0)
+	{
+		/* durable_rename produced a log entry */
+		fprintf(stderr, _("%s: could not rename file \"%s\": %s\n"),
+				progname, current_walfile_name, strerror(errno));
+		return -1;
+	}
+
+	/*
+	 * To guarantee renaming of the file is persistent, fsync the file with its
+	 * new name, and its containing directory.
+	 */
+	fd = open(newfile, O_RDWR | PG_BINARY, 0);
+	if (fd < 0)
+	{
+		fprintf(stderr, _("%s: could not open file \"%s\": %s\n"),
+				progname, newfile, strerror(errno));
+		return -1;
+	}
+
+	if (fsync(fd) != 0)
+	{
+		close(fd);
+		fprintf(stderr, _("%s: could not fsync file \"%s\": %s\n"),
+				progname, newfile, strerror(errno));
+		return -1;
+	}
+	close(fd);
+
+	strlcpy(parentpath, newfile, MAXPGPATH);
+	get_parent_directory(parentpath);
+
+	/*
+	 * get_parent_directory() returns an empty string if the input argument is
+	 * just a file name (see comments in path.c), so handle that as being the
+	 * current directory.
+	 */
+	if (strlen(parentpath) == 0)
+		strlcpy(parentpath, ".", MAXPGPATH);
+
+	fd = open(parentpath, O_RDONLY | PG_BINARY, 0);
+
+	/*
+	 * Some OSs don't allow us to open directories at all (Windows returns
+	 * EACCES), just ignore the error in that case.  If desired also silently
+	 * ignoring errors about unreadable files. Log others.
+	 */
+	if (fd < 0 && (errno == EISDIR || errno == EACCES))
+		return 0;
+	else if (fd < 0)
+	{
+		fprintf(stderr, _("%s: could not open file \"%s\": %s\n"),
+				progname, parentpath, strerror(errno));
+		return -1;
+	}
+
+	if (fsync(fd) != 0)
+	{
+		close(fd);
+		fprintf(stderr, _("%s: could not fsync file \"%s\": %s\n"),
+				progname, parentpath, strerror(errno));
+		return -1;
+	}
+	close(fd);
+
+	return 0;
+}
+
+/*
  * The main loop of ReceiveXlogStream. Handles the COPY stream after
  * initiating streaming with the START_STREAMING command.
  *
-- 
2.7.3

#80Andres Freund
andres@anarazel.de
In reply to: Michael Paquier (#79)
Re: silent data loss with ext4 / all current versions

On 2016-03-17 23:05:42 +0900, Michael Paquier wrote:

Are you working on a fix for pg_rewind? Let's go with initdb -S in a
first iteration, then we can, if somebody is interest enough, work on
making this nicer in master.

I am really -1 for this approach. Wrapping initdb -S with
find_other_exec is intrusive in back-branches knowing that all the I/O
write operations manipulating file descriptors go through file_ops.c,
and we actually just need to fsync the target file in
close_target_file(), making the fix being a 7-line patch, and there is
no need to depend on another binary at all. The backup label file, as
well as the control file are using the same code path in file_ops.c...
And I like simple things :)

This is a *much* more expensive approach though. Doing the fsync
directly after modifying the file. One file by one file. Will usually
result in each fsync blocking for a while.

In comparison of doing a flush and then an fsync pass over the whole
directory will usually only block seldomly. The flushes for all files
can be combined into very few barrier operations.

Besides that, you're not syncing the directories, despite
open_target_file() potentially creating the directory.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#81Michael Paquier
michael.paquier@gmail.com
In reply to: Andres Freund (#80)
3 attachment(s)
Re: silent data loss with ext4 / all current versions

On Fri, Mar 18, 2016 at 12:03 AM, Andres Freund <andres@anarazel.de> wrote:

This is a *much* more expensive approach though. Doing the fsync
directly after modifying the file. One file by one file. Will usually
result in each fsync blocking for a while.

In comparison of doing a flush and then an fsync pass over the whole
directory will usually only block seldomly. The flushes for all files
can be combined into very few barrier operations.

Hm... OK. I'd really like to keep the run of pg_rewind minimal as well
if possible. So here you go.
--
Michael

Attachments:

0001-Close-file-descriptor-associated-to-backup_label-cor.patchbinary/octet-stream; name=0001-Close-file-descriptor-associated-to-backup_label-cor.patchDownload
From e030f9b15939d448eeef16e1d62aa6838ea94084 Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@otacoo.com>
Date: Thu, 17 Mar 2016 22:30:42 +0900
Subject: [PATCH 1/3] Close file descriptor associated to backup_label
 correctly

The file descriptor used to generate the backup_label file was correctly
opened and written to, however it was never closed, causing a leak.
---
 src/bin/pg_rewind/pg_rewind.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/src/bin/pg_rewind/pg_rewind.c b/src/bin/pg_rewind/pg_rewind.c
index 96a42f8..c5589b9 100644
--- a/src/bin/pg_rewind/pg_rewind.c
+++ b/src/bin/pg_rewind/pg_rewind.c
@@ -584,6 +584,7 @@ createBackupLabel(XLogRecPtr startpoint, TimeLineID starttli, XLogRecPtr checkpo
 	/* TODO: move old file out of the way, if any. */
 	open_target_file("backup_label", true);		/* BACKUP_LABEL_FILE */
 	write_target_range(buf, 0, len);
+	close_target_file();
 }
 
 /*
-- 
2.7.3

0002-fsync-properly-files-modified-by-pg_rewind.patchbinary/octet-stream; name=0002-fsync-properly-files-modified-by-pg_rewind.patchDownload
From 4d1b1c999285f00273ee2afe322133a18ffd23d5 Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@otacoo.com>
Date: Fri, 18 Mar 2016 14:53:53 +0900
Subject: [PATCH 2/3] fsync properly files modified by pg_rewind

Files updated by pg_rewind may have their changes lost in case of crashes
if those are not flushed correctly to disk, making a potential PGDATA
directory corrupted. pg_rewind invokes initdb -S for this purpose, flushing
all dirty files at once for performance purposes because a short execution
time matters with pg_rewind.
---
 src/bin/pg_rewind/file_ops.c  |  3 ++-
 src/bin/pg_rewind/pg_rewind.c | 54 +++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 56 insertions(+), 1 deletion(-)

diff --git a/src/bin/pg_rewind/file_ops.c b/src/bin/pg_rewind/file_ops.c
index 32eab3a..e775685 100644
--- a/src/bin/pg_rewind/file_ops.c
+++ b/src/bin/pg_rewind/file_ops.c
@@ -79,7 +79,8 @@ close_target_file(void)
 				 dstpath, strerror(errno));
 
 	dstfd = -1;
-	/* fsync? */
+
+	/* fsync is done globally at the end of processing */
 }
 
 void
diff --git a/src/bin/pg_rewind/pg_rewind.c b/src/bin/pg_rewind/pg_rewind.c
index c5589b9..5377bd4 100644
--- a/src/bin/pg_rewind/pg_rewind.c
+++ b/src/bin/pg_rewind/pg_rewind.c
@@ -36,6 +36,7 @@ static void createBackupLabel(XLogRecPtr startpoint, TimeLineID starttli,
 static void digestControlFile(ControlFileData *ControlFile, char *source,
 				  size_t size);
 static void updateControlFile(ControlFileData *ControlFile);
+static void syncTargetDirectory(const char *argv0);
 static void sanityChecks(void);
 static void findCommonAncestorTimeline(XLogRecPtr *recptr, int *tliIndex);
 
@@ -349,6 +350,9 @@ main(int argc, char **argv)
 	ControlFile_new.state = DB_IN_ARCHIVE_RECOVERY;
 	updateControlFile(&ControlFile_new);
 
+	pg_log(PG_PROGRESS, "syncing target data directory via initdb -S\n");
+	syncTargetDirectory(argv[0]);
+
 	printf(_("Done!\n"));
 
 	return 0;
@@ -650,3 +654,53 @@ updateControlFile(ControlFileData *ControlFile)
 
 	close_target_file();
 }
+
+/*
+ * Sync data directory to ensure that what has been generated up to now is
+ * persistent in case of a crash, and this is done once globally for
+ * performance reasons as sync requests on individual files would be
+ * a negative impact on the running time of pg_rewind. This is invoked at
+ * the end of processing once everything has been processed and written.
+ */
+static void
+syncTargetDirectory(const char *argv0)
+{
+	int		ret;
+	char	exec_path[MAXPGPATH];
+	char	cmd[MAXPGPATH];
+
+	if (dry_run)
+		return;
+
+	/* Grab and invoke initdb to perform the sync */
+	if ((ret = find_other_exec(argv0, "initdb",
+							   "initdb (PostgreSQL) " PG_VERSION "\n",
+							   exec_path)) < 0)
+	{
+		char        full_path[MAXPGPATH];
+
+		if (find_my_exec(argv0, full_path) < 0)
+			strlcpy(full_path, progname, sizeof(full_path));
+
+		if (ret == -1)
+			pg_fatal("The program \"initdb\" is needed by %s but was \n"
+					 "not found in the same directory as \"%s\".\n"
+					 "Check your installation.\n", progname, full_path);
+		else
+			pg_fatal("The program \"postgres\" was found by \"%s\" but was \n"
+					 "not the same version as %s.\n"
+					 "Check your installation.\n", progname, full_path);
+	}
+
+	/* now run initdb */
+	if (debug)
+		snprintf(cmd, MAXPGPATH, "\"%s\" -D \"%s\" -S",
+				 exec_path, datadir_target);
+	else
+		snprintf(cmd, MAXPGPATH, "\"%s\" -D \"%s\" -S > \"%s\"",
+				 exec_path, datadir_target, DEVNULL);
+
+	if (system(cmd) != 0)
+		pg_fatal("sync of target directory with initdb -S failed\n");
+	pg_log(PG_PROGRESS, "sync of target directory with initdb -S done\n");
+}
-- 
2.7.3

0003-Avoid-potential-data-loss-in-pg_receivexlog.patchbinary/octet-stream; name=0003-Avoid-potential-data-loss-in-pg_receivexlog.patchDownload
From f07338c6d275866a701038fe057d7db5c1992db9 Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@otacoo.com>
Date: Thu, 17 Mar 2016 23:01:47 +0900
Subject: [PATCH 3/3] Avoid potential data loss in pg_receivexlog

pg_receivexlog makes use of rename() for timeline history files as well
as for completed WAL segments. However, this is not reliable and may cause
the rename operation to be lost in case of crashes. This commit makes use
of a similar function to backend's durable_name to make the renaming operation
durable on disk.
---
 src/bin/pg_basebackup/receivelog.c | 93 +++++++++++++++++++++++++++++++++++---
 1 file changed, 87 insertions(+), 6 deletions(-)

diff --git a/src/bin/pg_basebackup/receivelog.c b/src/bin/pg_basebackup/receivelog.c
index 595213f..cf9af83 100644
--- a/src/bin/pg_basebackup/receivelog.c
+++ b/src/bin/pg_basebackup/receivelog.c
@@ -50,6 +50,7 @@ static long CalculateCopyStreamSleeptime(int64 now, int standby_message_timeout,
 
 static bool ReadEndOfStreamingResult(PGresult *res, XLogRecPtr *startpos,
 						 uint32 *timeline);
+static int	durable_rename(const char *oldfile, const char *newfile);
 
 static bool
 mark_file_as_archived(const char *basedir, const char *fname)
@@ -217,10 +218,9 @@ close_walfile(StreamCtl *stream, XLogRecPtr pos)
 
 		snprintf(oldfn, sizeof(oldfn), "%s/%s%s", stream->basedir, current_walfile_name, stream->partial_suffix);
 		snprintf(newfn, sizeof(newfn), "%s/%s", stream->basedir, current_walfile_name);
-		if (rename(oldfn, newfn) != 0)
+		if (durable_rename(oldfn, newfn) != 0)
 		{
-			fprintf(stderr, _("%s: could not rename file \"%s\": %s\n"),
-					progname, current_walfile_name, strerror(errno));
+			/* durable_rename produced a log entry */
 			return false;
 		}
 	}
@@ -356,10 +356,9 @@ writeTimeLineHistoryFile(StreamCtl *stream, char *filename, char *content)
 	/*
 	 * Now move the completed history file into place with its final name.
 	 */
-	if (rename(tmppath, path) < 0)
+	if (durable_rename(tmppath, path) < 0)
 	{
-		fprintf(stderr, _("%s: could not rename file \"%s\" to \"%s\": %s\n"),
-				progname, tmppath, path, strerror(errno));
+		/* durable_rename produced a log entry */
 		return false;
 	}
 
@@ -786,6 +785,88 @@ ReadEndOfStreamingResult(PGresult *res, XLogRecPtr *startpos, uint32 *timeline)
 }
 
 /*
+ * Wrapper of rename() similar to the backend version with the same function
+ * name aimed at making the renaming durable on disk. Note that this version
+ * does not fsync the old file before the rename as all the code paths leading
+ * to this function are already doing this operation. The new file is also
+ * normally not present on disk before the renaming so there is no need to
+ * bother about it.
+ */
+static int
+durable_rename(const char *oldfile, const char *newfile)
+{
+	int		fd;
+	char	parentpath[MAXPGPATH];
+
+	if (rename(oldfile, newfile) != 0)
+	{
+		/* durable_rename produced a log entry */
+		fprintf(stderr, _("%s: could not rename file \"%s\": %s\n"),
+				progname, current_walfile_name, strerror(errno));
+		return -1;
+	}
+
+	/*
+	 * To guarantee renaming of the file is persistent, fsync the file with its
+	 * new name, and its containing directory.
+	 */
+	fd = open(newfile, O_RDWR | PG_BINARY, 0);
+	if (fd < 0)
+	{
+		fprintf(stderr, _("%s: could not open file \"%s\": %s\n"),
+				progname, newfile, strerror(errno));
+		return -1;
+	}
+
+	if (fsync(fd) != 0)
+	{
+		close(fd);
+		fprintf(stderr, _("%s: could not fsync file \"%s\": %s\n"),
+				progname, newfile, strerror(errno));
+		return -1;
+	}
+	close(fd);
+
+	strlcpy(parentpath, newfile, MAXPGPATH);
+	get_parent_directory(parentpath);
+
+	/*
+	 * get_parent_directory() returns an empty string if the input argument is
+	 * just a file name (see comments in path.c), so handle that as being the
+	 * current directory.
+	 */
+	if (strlen(parentpath) == 0)
+		strlcpy(parentpath, ".", MAXPGPATH);
+
+	fd = open(parentpath, O_RDONLY | PG_BINARY, 0);
+
+	/*
+	 * Some OSs don't allow us to open directories at all (Windows returns
+	 * EACCES), just ignore the error in that case.  If desired also silently
+	 * ignoring errors about unreadable files. Log others.
+	 */
+	if (fd < 0 && (errno == EISDIR || errno == EACCES))
+		return 0;
+	else if (fd < 0)
+	{
+		fprintf(stderr, _("%s: could not open file \"%s\": %s\n"),
+				progname, parentpath, strerror(errno));
+		return -1;
+	}
+
+	if (fsync(fd) != 0)
+	{
+		close(fd);
+		fprintf(stderr, _("%s: could not fsync file \"%s\": %s\n"),
+				progname, parentpath, strerror(errno));
+		return -1;
+	}
+	close(fd);
+
+	return 0;
+}
+
+/*
  * The main loop of ReceiveXlogStream. Handles the COPY stream after
  * initiating streaming with the START_STREAMING command.
  *
-- 
2.7.3

#82Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#80)
Re: silent data loss with ext4 / all current versions

On Thu, Mar 17, 2016 at 11:03 AM, Andres Freund <andres@anarazel.de> wrote:

On 2016-03-17 23:05:42 +0900, Michael Paquier wrote:

Are you working on a fix for pg_rewind? Let's go with initdb -S in a
first iteration, then we can, if somebody is interest enough, work on
making this nicer in master.

I am really -1 for this approach. Wrapping initdb -S with
find_other_exec is intrusive in back-branches knowing that all the I/O
write operations manipulating file descriptors go through file_ops.c,
and we actually just need to fsync the target file in
close_target_file(), making the fix being a 7-line patch, and there is
no need to depend on another binary at all. The backup label file, as
well as the control file are using the same code path in file_ops.c...
And I like simple things :)

This is a *much* more expensive approach though. Doing the fsync
directly after modifying the file. One file by one file. Will usually
result in each fsync blocking for a while.

In comparison of doing a flush and then an fsync pass over the whole
directory will usually only block seldomly. The flushes for all files
can be combined into very few barrier operations.

Yeah, I'm pretty sure this was discussed when initdb -S went in. I
think reusing that is a good idea.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#83Andres Freund
andres@anarazel.de
In reply to: Michael Paquier (#81)
1 attachment(s)
Re: silent data loss with ext4 / all current versions

Hi,

On 2016-03-18 15:08:32 +0900, Michael Paquier wrote:

+/*
+ * Sync data directory to ensure that what has been generated up to now is
+ * persistent in case of a crash, and this is done once globally for
+ * performance reasons as sync requests on individual files would be
+ * a negative impact on the running time of pg_rewind. This is invoked at
+ * the end of processing once everything has been processed and written.
+ */
+static void
+syncTargetDirectory(const char *argv0)
+{
+	int		ret;
+	char	exec_path[MAXPGPATH];
+	char	cmd[MAXPGPATH];
+
+	if (dry_run)
+		return;

I think it makes more sense to return after detecting the binary, so
you'd find out about problems around not finding initdb during the dry
run, not later.

+	/* Grab and invoke initdb to perform the sync */
+	if ((ret = find_other_exec(argv0, "initdb",
+							   "initdb (PostgreSQL) " PG_VERSION "\n",
+							   exec_path)) < 0)
+	{
+		char        full_path[MAXPGPATH];
+
+		if (find_my_exec(argv0, full_path) < 0)
+			strlcpy(full_path, progname, sizeof(full_path));
+
+		if (ret == -1)
+			pg_fatal("The program \"initdb\" is needed by %s but was \n"
+					 "not found in the same directory as \"%s\".\n"
+					 "Check your installation.\n", progname, full_path);
+		else
+			pg_fatal("The program \"postgres\" was found by \"%s\" but was \n"
+					 "not the same version as %s.\n"
+					 "Check your installation.\n", progname, full_path);

Wrong binary name.

+	}
+
+	/* now run initdb */
+	if (debug)
+		snprintf(cmd, MAXPGPATH, "\"%s\" -D \"%s\" -S",
+				 exec_path, datadir_target);
+	else
+		snprintf(cmd, MAXPGPATH, "\"%s\" -D \"%s\" -S > \"%s\"",
+				 exec_path, datadir_target, DEVNULL);
+
+	if (system(cmd) != 0)
+		pg_fatal("sync of target directory with initdb -S failed\n");
+	pg_log(PG_PROGRESS, "sync of target directory with initdb -S done\n");
+}

Don't see need for emitting "done", for now at least.

/*
+ * Wrapper of rename() similar to the backend version with the same function
+ * name aimed at making the renaming durable on disk. Note that this version
+ * does not fsync the old file before the rename as all the code paths leading
+ * to this function are already doing this operation. The new file is also
+ * normally not present on disk before the renaming so there is no need to
+ * bother about it.

I don't think it's a good idea to skip fsyncing the old file based on
that; it's way too likely that that'll not be done for the next user of
durable_rename.

+ */
+static int
+durable_rename(const char *oldfile, const char *newfile)
+{
+	int		fd;
+	char	parentpath[MAXPGPATH];
+
+	if (rename(oldfile, newfile) != 0)
+	{
+		/* durable_rename produced a log entry */

Uh?

+		fprintf(stderr, _("%s: could not rename file \"%s\": %s\n"),
+				progname, current_walfile_name, strerror(errno));

current_walfile_name doesn't look right, that's a largely independent
global variable.

+	if (fsync(fd) != 0)
+	{
+		close(fd);
+		fprintf(stderr, _("%s: could not fsync file \"%s\": %s\n"),
+				progname, newfile, strerror(errno));
+		return -1;

Close should be after the strerror() call (yes, that mistake already
existed once in receivelog.c).

+	fd = open(parentpath, O_RDONLY | PG_BINARY, 0);
+
+	/*
+	 * Some OSs don't allow us to open directories at all (Windows returns
+	 * EACCES), just ignore the error in that case.  If desired also silently
+	 * ignoring errors about unreadable files. Log others.
+	 */

Comment is not applicable as a whole.

+	if (fd < 0 && (errno == EISDIR || errno == EACCES))
+		return 0;
+	else if (fd < 0)
+	{
+		fprintf(stderr, _("%s: could not open file \"%s\": %s\n"),
+				progname, parentpath, strerror(errno));
+		return -1;
+	}
+
+	if (fsync(fd) != 0)
+	{
+		close(fd);
+		fprintf(stderr, _("%s: could not fsync file \"%s\": %s\n"),
+				progname, parentpath, strerror(errno));
+		return -1;

close() needs to be moved again.

It'd be easier to apply these if the rate of trivial issues were lower
:(.

Attached is an heavily revised version of the patch. Besides the above,
I've:
* copied fsync_fname_ext from initdb, I think it's easier to understand
code this way, and it'll make unifying the code easier
* added fsyncing of the directory in mark_file_as_archived()
* The WAL files also need to be fsynced when created in open_walfile():
- otherwise the directory entry itself isn't safely persistent, as we
don't fsync the directory in the stream->synchronous fsync() cases.
- we refuse to resume in open_walfile(), if a file isn't 16MB when
restarting. Without an fsync that's actually not unlikely after a
crash. Even with an fsync that's not guaranteed not to happen, but
the chance of it is much lower.

I'm too tired to push this at the eleventh hour. Besides a heavily
revised patch, backpatching will likely include a number of conflicts.
If somebody in the US has the energy to take care of this...

I've also noticed that

a) pg_basebackup doesn't do anything about durability (it probably needs
a very similar patch to the one pg_rewind just received).
b) nor does pg_dump[all]

I think it's pretty unacceptable for backup tools to be so cavalier
about durability.

So we're going to have another round of fsync stuff in the next set of
releases anyway...

Greetings,

Andres Freund

Attachments:

0001-Issue-fsync-more-carefully-in-pg_receivexlog-and-pg_.patchtext/x-patch; charset=us-asciiDownload
From 7c909e9913ea9acac6d0fc6ac8a40e62584568a3 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 28 Mar 2016 01:24:42 +0200
Subject: [PATCH] Issue fsync more carefully in pg_receivexlog and
 pg_basebackup -X stream.

Several places weren't careful about fsyncing in the way. See 1d4a0ab1
and 606e0f98 for details about required fsyns.

This introduces a near-copy of initdb's fsync_fname_ext(), and of the
backend's durable_rename(), fsync_parent_path(). At least the frontend
duplication should be avoided; but that'd end up being hard to
backpatch.

Author: Michael Paquier, heavily revised by me
Discussion: CAB7nPqRmM+CX6bVxw0Y7mMVGMFj1S8kwhevt8TaP83yeFRfbXA@mail.gmail.com
Backpatch: 9.1 (in parts)
---
 src/bin/pg_basebackup/receivelog.c | 188 ++++++++++++++++++++++++++++++++-----
 1 file changed, 164 insertions(+), 24 deletions(-)

diff --git a/src/bin/pg_basebackup/receivelog.c b/src/bin/pg_basebackup/receivelog.c
index 595213f..c533ad1 100644
--- a/src/bin/pg_basebackup/receivelog.c
+++ b/src/bin/pg_basebackup/receivelog.c
@@ -50,6 +50,9 @@ static long CalculateCopyStreamSleeptime(int64 now, int standby_message_timeout,
 
 static bool ReadEndOfStreamingResult(PGresult *res, XLogRecPtr *startpos,
 						 uint32 *timeline);
+static int	fsync_parent_path(const char *fname);
+static int	fsync_fname_ext(const char *fname, bool isdir);
+static int	durable_rename(const char *oldfile, const char *newfile);
 
 static bool
 mark_file_as_archived(const char *basedir, const char *fname)
@@ -68,18 +71,14 @@ mark_file_as_archived(const char *basedir, const char *fname)
 		return false;
 	}
 
-	if (fsync(fd) != 0)
-	{
-		fprintf(stderr, _("%s: could not fsync file \"%s\": %s\n"),
-				progname, tmppath, strerror(errno));
-
-		close(fd);
-
-		return false;
-	}
-
 	close(fd);
 
+	if (fsync_fname_ext(tmppath, false) != 0)
+		return false;
+
+	if (fsync_parent_path(tmppath) != 0)
+		return false;
+
 	return true;
 }
 
@@ -116,6 +115,10 @@ open_walfile(StreamCtl *stream, XLogRecPtr startpoint)
 	/*
 	 * Verify that the file is either empty (just created), or a complete
 	 * XLogSegSize segment. Anything in between indicates a corrupt file.
+	 *
+	 * XXX: This means that we might not restart if a crash occurs before the
+	 * fsync below. We probably should create the file in a temporary path
+	 * like the backend does...
 	 */
 	if (fstat(f, &statbuf) != 0)
 	{
@@ -129,6 +132,16 @@ open_walfile(StreamCtl *stream, XLogRecPtr startpoint)
 	{
 		/* File is open and ready to use */
 		walfile = f;
+
+		/*
+		 * fsync, in case of a previous crash between padding and fsyncing the
+		 * file.
+		 */
+		if (fsync_fname_ext(fn, false) != 0)
+			return false;
+		if (fsync_parent_path(fn) != 0)
+			return false;
+
 		return true;
 	}
 	if (statbuf.st_size != 0)
@@ -157,6 +170,17 @@ open_walfile(StreamCtl *stream, XLogRecPtr startpoint)
 	}
 	free(zerobuf);
 
+	/*
+	 * fsync WAL file and containing directory, to ensure the file is
+	 * persistently created and zeroed. That's particularly important when
+	 * using synchronous mode, where the file is modified and fsynced
+	 * in-place, without a directory fsync.
+	 */
+	if (fsync_fname_ext(fn, false) != 0)
+		return false;
+	if (fsync_parent_path(fn) != 0)
+		return false;
+
 	if (lseek(f, SEEK_SET, 0) != 0)
 	{
 		fprintf(stderr,
@@ -217,10 +241,9 @@ close_walfile(StreamCtl *stream, XLogRecPtr pos)
 
 		snprintf(oldfn, sizeof(oldfn), "%s/%s%s", stream->basedir, current_walfile_name, stream->partial_suffix);
 		snprintf(newfn, sizeof(newfn), "%s/%s", stream->basedir, current_walfile_name);
-		if (rename(oldfn, newfn) != 0)
+		if (durable_rename(oldfn, newfn) != 0)
 		{
-			fprintf(stderr, _("%s: could not rename file \"%s\": %s\n"),
-					progname, current_walfile_name, strerror(errno));
+			/* durable_rename produced a log entry */
 			return false;
 		}
 	}
@@ -338,14 +361,6 @@ writeTimeLineHistoryFile(StreamCtl *stream, char *filename, char *content)
 		return false;
 	}
 
-	if (fsync(fd) != 0)
-	{
-		close(fd);
-		fprintf(stderr, _("%s: could not fsync file \"%s\": %s\n"),
-				progname, tmppath, strerror(errno));
-		return false;
-	}
-
 	if (close(fd) != 0)
 	{
 		fprintf(stderr, _("%s: could not close file \"%s\": %s\n"),
@@ -356,10 +371,9 @@ writeTimeLineHistoryFile(StreamCtl *stream, char *filename, char *content)
 	/*
 	 * Now move the completed history file into place with its final name.
 	 */
-	if (rename(tmppath, path) < 0)
+	if (durable_rename(tmppath, path) < 0)
 	{
-		fprintf(stderr, _("%s: could not rename file \"%s\" to \"%s\": %s\n"),
-				progname, tmppath, path, strerror(errno));
+		/* durable_rename produced a log entry */
 		return false;
 	}
 
@@ -786,6 +800,132 @@ ReadEndOfStreamingResult(PGresult *res, XLogRecPtr *startpos, uint32 *timeline)
 }
 
 /*
+ * fsync_fname_ext -- Try to fsync a file or directory
+ *
+ * Returns 0 if the operation succeeded, -1 otherwise.
+ *
+ * XXX: This is a near-duplicate of initdb.c's fsync_fname_ext(); they should
+ * be unified into a common place.
+ */
+static int
+fsync_fname_ext(const char *fname, bool isdir)
+{
+	int			fd;
+	int			flags;
+	int			returncode;
+
+	/*
+	 * Some OSs require directories to be opened read-only whereas other
+	 * systems don't allow us to fsync files opened read-only; so we need both
+	 * cases here.  Using O_RDWR will cause us to fail to fsync files that are
+	 * not writable by our userid, but we assume that's OK.
+	 */
+	flags = PG_BINARY;
+	if (!isdir)
+		flags |= O_RDWR;
+	else
+		flags |= O_RDONLY;
+
+	/*
+	 * Open the file, silently ignoring errors about unreadable files (or
+	 * unsupported operations, e.g. opening a directory under Windows), and
+	 * logging others.
+	 */
+	fd = open(fname, flags);
+	if (fd < 0)
+	{
+		if (isdir && (errno == EISDIR || errno == EACCES))
+			return 0;
+		fprintf(stderr, _("%s: could not open file \"%s\": %s\n"),
+				progname, fname, strerror(errno));
+		return -1;
+	}
+
+	returncode = fsync(fd);
+
+	/*
+	 * Some OSes don't allow us to fsync directories at all, so we can ignore
+	 * those errors. Anything else needs to be reported.
+	 */
+	if (returncode != 0 && !(isdir && errno == EBADF))
+	{
+		fprintf(stderr, _("%s: could not fsync file \"%s\": %s\n"),
+				progname, fname, strerror(errno));
+		close(fd);
+		return -1;
+	}
+
+	close(fd);
+	return 0;
+}
+
+/*
+ * fsync_parent_path -- fsync the parent path of a file or directory
+ *
+ * This is aimed at making file operations persistent on disk in case of
+ * an OS crash or power failure.
+ */
+static int
+fsync_parent_path(const char *fname)
+{
+	char		parentpath[MAXPGPATH];
+
+	strlcpy(parentpath, fname, MAXPGPATH);
+	get_parent_directory(parentpath);
+
+	/*
+	 * get_parent_directory() returns an empty string if the input argument is
+	 * just a file name (see comments in path.c), so handle that as being the
+	 * current directory.
+	 */
+	if (strlen(parentpath) == 0)
+		strlcpy(parentpath, ".", MAXPGPATH);
+
+	if (fsync_fname_ext(parentpath, true) != 0)
+		return -1;
+
+	return 0;
+}
+
+/*
+ * durable_rename -- rename(2) wrapper, issuing fsyncs required for durability
+ *
+ * Wrapper around rename, similar to the backend version.  Note that this
+ * version does not fsync the target file before the rename, as it's unlikely
+ * to be helpful for current and prospective users.
+ */
+static int
+durable_rename(const char *oldfile, const char *newfile)
+{
+	/*
+	 * First fsync the old path, to ensure that it is properly persistent on
+	 * disk.
+	 */
+	if (fsync_fname_ext(oldfile, false) != 0)
+		return -1;
+
+	/* Time to do the real deal... */
+	if (rename(oldfile, newfile) != 0)
+	{
+		fprintf(stderr, _("%s: could not rename file \"%s\" to \"%s\": %s\n"),
+				progname, oldfile, newfile, strerror(errno));
+		return -1;
+	}
+
+	/*
+	 * To guarantee renaming the file is persistent, fsync the file with its
+	 * new name, and its containing directory.
+	 */
+	if (fsync_fname_ext(newfile, false) != 0)
+		return -1;
+
+	if (fsync_parent_path(newfile) != 0)
+		return -1;
+
+	return 0;
+}
+
+/*
  * The main loop of ReceiveXlogStream. Handles the COPY stream after
  * initiating streaming with the START_STREAMING command.
  *
-- 
2.7.0.229.g701fa7f.dirty

#84Michael Paquier
michael.paquier@gmail.com
In reply to: Andres Freund (#83)
3 attachment(s)
Re: silent data loss with ext4 / all current versions

On Mon, Mar 28, 2016 at 8:25 AM, Andres Freund <andres@anarazel.de> wrote:

On 2016-03-18 15:08:32 +0900, Michael Paquier wrote:

+             fprintf(stderr, _("%s: could not rename file \"%s\": %s\n"),
+                             progname, current_walfile_name, strerror(errno));

current_walfile_name doesn't look right, that's a largely independent
global variable.

Stupid mistake.

+     if (fsync(fd) != 0)
+     {
+             close(fd);
+             fprintf(stderr, _("%s: could not fsync file \"%s\": %s\n"),
+                             progname, newfile, strerror(errno));
+             return -1;

Close should be after the strerror() call (yes, that mistake already
existed once in receivelog.c).

Right.

It'd be easier to apply these if the rate of trivial issues were lower
:(.

Sorry about that. I'll be more careful.

Attached is an heavily revised version of the patch. Besides the above,
I've:
* copied fsync_fname_ext from initdb, I think it's easier to understand
code this way, and it'll make unifying the code easier

OK.

* added fsyncing of the directory in mark_file_as_archived()
* The WAL files also need to be fsynced when created in open_walfile():
- otherwise the directory entry itself isn't safely persistent, as we
don't fsync the directory in the stream->synchronous fsync() cases.
- we refuse to resume in open_walfile(), if a file isn't 16MB when
restarting. Without an fsync that's actually not unlikely after a
crash. Even with an fsync that's not guaranteed not to happen, but
the chance of it is much lower.
I'm too tired to push this at the eleventh hour. Besides a heavily
revised patch, backpatching will likely include a number of conflicts.
If somebody in the US has the energy to take care of this...

Close enough to the US. Attached are backpatchable versions based on
the corrected version you sent. 9.3 and 9.4 share the same patch, more
work has been necessary for 9.2 but that's not huge either.

So we're going to have another round of fsync stuff in the next set of
releases anyway...

Yes, seeing how 9.5.2 is close by, I think that it would be wiser to
push this stuff after the upcoming minor release.
--
Michael

Attachments:

pg_receivexlog-sync-94-93.patchinvalid/octet-stream; name=pg_receivexlog-sync-94-93.patchDownload
diff --git a/src/bin/pg_basebackup/receivelog.c b/src/bin/pg_basebackup/receivelog.c
index e1bd4ad..d8145e9 100644
--- a/src/bin/pg_basebackup/receivelog.c
+++ b/src/bin/pg_basebackup/receivelog.c
@@ -40,6 +40,9 @@ static PGresult *HandleCopyStream(PGconn *conn, XLogRecPtr startpos,
 
 static bool ReadEndOfStreamingResult(PGresult *res, XLogRecPtr *startpos,
 						 uint32 *timeline);
+static int	fsync_parent_path(const char *fname);
+static int	fsync_fname_ext(const char *fname, bool isdir);
+static int	durable_rename(const char *oldfile, const char *newfile);
 
 static bool
 mark_file_as_archived(const char *basedir, const char *fname)
@@ -58,17 +61,13 @@ mark_file_as_archived(const char *basedir, const char *fname)
 		return false;
 	}
 
-	if (fsync(fd) != 0)
-	{
-		fprintf(stderr, _("%s: could not fsync file \"%s\": %s\n"),
-				progname, tmppath, strerror(errno));
-
-		close(fd);
+	close(fd);
 
+	if (fsync_fname_ext(tmppath, false) != 0)
 		return false;
-	}
 
-	close(fd);
+	if (fsync_parent_path(tmppath) != 0)
+		return false;
 
 	return true;
 }
@@ -107,6 +106,10 @@ open_walfile(XLogRecPtr startpoint, uint32 timeline, char *basedir,
 	/*
 	 * Verify that the file is either empty (just created), or a complete
 	 * XLogSegSize segment. Anything in between indicates a corrupt file.
+	 *
+	 * XXX: This means that we might not restart if a crash occurs before the
+	 * fsync below. We probably should create the file in a temporary path
+	 * like the backend does...
 	 */
 	if (fstat(f, &statbuf) != 0)
 	{
@@ -120,6 +123,16 @@ open_walfile(XLogRecPtr startpoint, uint32 timeline, char *basedir,
 	{
 		/* File is open and ready to use */
 		walfile = f;
+
+		/*
+		 * fsync, in case of a previous crash between padding and fsyncing the
+		 * file.
+		 */
+		if (fsync_fname_ext(fn, false) != 0)
+			return false;
+		if (fsync_parent_path(fn) != 0)
+			return false;
+
 		return true;
 	}
 	if (statbuf.st_size != 0)
@@ -148,6 +161,15 @@ open_walfile(XLogRecPtr startpoint, uint32 timeline, char *basedir,
 	}
 	free(zerobuf);
 
+	/*
+	 * fsync WAL file and containing directory, to ensure the file is
+	 * persistently created and zeroed.
+	 */
+	if (fsync_fname_ext(fn, false) != 0)
+		return false;
+	if (fsync_parent_path(fn) != 0)
+		return false;
+
 	if (lseek(f, SEEK_SET, 0) != 0)
 	{
 		fprintf(stderr,
@@ -208,10 +230,9 @@ close_walfile(char *basedir, char *partial_suffix, bool mark_done)
 
 		snprintf(oldfn, sizeof(oldfn), "%s/%s%s", basedir, current_walfile_name, partial_suffix);
 		snprintf(newfn, sizeof(newfn), "%s/%s", basedir, current_walfile_name);
-		if (rename(oldfn, newfn) != 0)
+		if (durable_rename(oldfn, newfn) != 0)
 		{
-			fprintf(stderr, _("%s: could not rename file \"%s\": %s\n"),
-					progname, current_walfile_name, strerror(errno));
+			/* durable_rename produced a log entry */
 			return false;
 		}
 	}
@@ -386,14 +407,6 @@ writeTimeLineHistoryFile(char *basedir, TimeLineID tli, char *filename,
 		return false;
 	}
 
-	if (fsync(fd) != 0)
-	{
-		close(fd);
-		fprintf(stderr, _("%s: could not fsync file \"%s\": %s\n"),
-				progname, tmppath, strerror(errno));
-		return false;
-	}
-
 	if (close(fd) != 0)
 	{
 		fprintf(stderr, _("%s: could not close file \"%s\": %s\n"),
@@ -404,10 +417,9 @@ writeTimeLineHistoryFile(char *basedir, TimeLineID tli, char *filename,
 	/*
 	 * Now move the completed history file into place with its final name.
 	 */
-	if (rename(tmppath, path) < 0)
+	if (durable_rename(tmppath, path) < 0)
 	{
-		fprintf(stderr, _("%s: could not rename file \"%s\" to \"%s\": %s\n"),
-				progname, tmppath, path, strerror(errno));
+		/* durable_rename produced a log entry */
 		return false;
 	}
 
@@ -833,6 +845,132 @@ ReadEndOfStreamingResult(PGresult *res, XLogRecPtr *startpos, uint32 *timeline)
 }
 
 /*
+ * fsync_fname_ext -- Try to fsync a file or directory
+ *
+ * Returns 0 if the operation succeeded, -1 otherwise.
+ *
+ * XXX: This is a near-duplicate of initdb.c's fsync_fname_ext(); they should
+ * be unified into a common place.
+ */
+static int
+fsync_fname_ext(const char *fname, bool isdir)
+{
+	int			fd;
+	int			flags;
+	int			returncode;
+
+	/*
+	 * Some OSs require directories to be opened read-only whereas other
+	 * systems don't allow us to fsync files opened read-only; so we need both
+	 * cases here.  Using O_RDWR will cause us to fail to fsync files that are
+	 * not writable by our userid, but we assume that's OK.
+	 */
+	flags = PG_BINARY;
+	if (!isdir)
+		flags |= O_RDWR;
+	else
+		flags |= O_RDONLY;
+
+	/*
+	 * Open the file, silently ignoring errors about unreadable files (or
+	 * unsupported operations, e.g. opening a directory under Windows), and
+	 * logging others.
+	 */
+	fd = open(fname, flags);
+	if (fd < 0)
+	{
+		if (isdir && (errno == EISDIR || errno == EACCES))
+			return 0;
+		fprintf(stderr, _("%s: could not open file \"%s\": %s\n"),
+				progname, fname, strerror(errno));
+		return -1;
+	}
+
+	returncode = fsync(fd);
+
+	/*
+	 * Some OSes don't allow us to fsync directories at all, so we can ignore
+	 * those errors. Anything else needs to be reported.
+	 */
+	if (returncode != 0 && !(isdir && errno == EBADF))
+	{
+		fprintf(stderr, _("%s: could not fsync file \"%s\": %s\n"),
+				progname, fname, strerror(errno));
+		close(fd);
+		return -1;
+	}
+
+	close(fd);
+	return 0;
+}
+
+/*
+ * fsync_parent_path -- fsync the parent path of a file or directory
+ *
+ * This is aimed at making file operations persistent on disk in case of
+ * an OS crash or power failure.
+ */
+static int
+fsync_parent_path(const char *fname)
+{
+	char		parentpath[MAXPGPATH];
+
+	strlcpy(parentpath, fname, MAXPGPATH);
+	get_parent_directory(parentpath);
+
+	/*
+	 * get_parent_directory() returns an empty string if the input argument is
+	 * just a file name (see comments in path.c), so handle that as being the
+	 * current directory.
+	 */
+	if (strlen(parentpath) == 0)
+		strlcpy(parentpath, ".", MAXPGPATH);
+
+	if (fsync_fname_ext(parentpath, true) != 0)
+		return -1;
+
+	return 0;
+}
+
+/*
+ * durable_rename -- rename(2) wrapper, issuing fsyncs required for durability
+ *
+ * Wrapper around rename, similar to the backend version.  Note that this
+ * version does not fsync the target file before the rename, as it's unlikely
+ * to be helpful for current and prospective users.
+ */
+static int
+durable_rename(const char *oldfile, const char *newfile)
+{
+	/*
+	 * First fsync the old path, to ensure that it is properly persistent on
+	 * disk.
+	 */
+	if (fsync_fname_ext(oldfile, false) != 0)
+		return -1;
+
+	/* Time to do the real deal... */
+	if (rename(oldfile, newfile) != 0)
+	{
+		fprintf(stderr, _("%s: could not rename file \"%s\" to \"%s\": %s\n"),
+				progname, oldfile, newfile, strerror(errno));
+		return -1;
+	}
+
+	/*
+	 * To guarantee renaming the file is persistent, fsync the file with its
+	 * new name, and its containing directory.
+	 */
+	if (fsync_fname_ext(newfile, false) != 0)
+		return -1;
+
+	if (fsync_parent_path(newfile) != 0)
+		return -1;
+
+	return 0;
+}
+
+/*
  * The main loop of ReceiveXLogStream. Handles the COPY stream after
  * initiating streaming with the START_STREAMING command.
  *
pg_receivexlog-sync-92.patchinvalid/octet-stream; name=pg_receivexlog-sync-92.patchDownload
diff --git a/src/bin/pg_basebackup/receivelog.c b/src/bin/pg_basebackup/receivelog.c
index b7f43d5..20b6227 100644
--- a/src/bin/pg_basebackup/receivelog.c
+++ b/src/bin/pg_basebackup/receivelog.c
@@ -43,6 +43,9 @@ const XLogRecPtr InvalidXLogRecPtr = {0, 0};
 /* fd for currently open WAL file */
 static int	walfile = -1;
 
+static int	fsync_parent_path(const char *fname);
+static int	fsync_fname_ext(const char *fname, bool isdir);
+static int	durable_rename(const char *oldfile, const char *newfile);
 
 static bool
 mark_file_as_archived(const char *basedir, const char *fname)
@@ -61,17 +64,13 @@ mark_file_as_archived(const char *basedir, const char *fname)
 		return false;
 	}
 
-	if (fsync(fd) != 0)
-	{
-		fprintf(stderr, _("%s: could not fsync file \"%s\": %s\n"),
-				progname, tmppath, strerror(errno));
-
-		close(fd);
+	close(fd);
 
+	if (fsync_fname_ext(tmppath, false) != 0)
 		return false;
-	}
 
-	close(fd);
+	if (fsync_parent_path(tmppath) != 0)
+		return false;
 
 	return true;
 }
@@ -109,6 +108,10 @@ open_walfile(XLogRecPtr startpoint, uint32 timeline, char *basedir,
 	/*
 	 * Verify that the file is either empty (just created), or a complete
 	 * XLogSegSize segment. Anything in between indicates a corrupt file.
+	 *
+	 * XXX: This means that we might not restart if a crash occurs before the
+	 * fsync below. We probably should create the file in a temporary path
+	 * like the backend does...
 	 */
 	if (fstat(f, &statbuf) != 0)
 	{
@@ -119,7 +122,19 @@ open_walfile(XLogRecPtr startpoint, uint32 timeline, char *basedir,
 		return -1;
 	}
 	if (statbuf.st_size == XLogSegSize)
-		return f;				/* File is open and ready to use */
+	{
+		/*
+		 * fsync, in case of a previous crash between padding and fsyncing the
+		 * file.
+		 */
+		if (fsync_fname_ext(fn, false) != 0)
+			return -1;
+		if (fsync_parent_path(fn) != 0)
+			return -1;
+
+		/* File is open and ready to use */
+		return f;
+	}
 	if (statbuf.st_size != 0)
 	{
 		fprintf(stderr,
@@ -146,6 +161,15 @@ open_walfile(XLogRecPtr startpoint, uint32 timeline, char *basedir,
 	}
 	free(zerobuf);
 
+	/*
+	 * fsync WAL file and containing directory, to ensure the file is
+	 * persistently created and zeroed.
+	 */
+	if (fsync_fname_ext(fn, false) != 0)
+		return false;
+	if (fsync_parent_path(fn) != 0)
+		return false;
+
 	if (lseek(f, SEEK_SET, 0) != 0)
 	{
 		fprintf(stderr,
@@ -205,10 +229,9 @@ close_walfile(char *basedir, char *walname, bool segment_complete,
 
 		snprintf(oldfn, sizeof(oldfn), "%s/%s.partial", basedir, walname);
 		snprintf(newfn, sizeof(newfn), "%s/%s", basedir, walname);
-		if (rename(oldfn, newfn) != 0)
+		if (durable_rename(oldfn, newfn) != 0)
 		{
-			fprintf(stderr, _("%s: could not rename file \"%s\": %s\n"),
-					progname, walname, strerror(errno));
+			/* durable_rename produced a log entry */
 			return false;
 		}
 	}
@@ -304,6 +327,132 @@ localTimestampDifferenceExceeds(TimestampTz start_time,
 }
 
 /*
+ * fsync_fname_ext -- Try to fsync a file or directory
+ *
+ * Returns 0 if the operation succeeded, -1 otherwise.
+ *
+ * XXX: This is a near-duplicate of initdb.c's fsync_fname_ext(); they should
+ * be unified into a common place.
+ */
+static int
+fsync_fname_ext(const char *fname, bool isdir)
+{
+	int			fd;
+	int			flags;
+	int			returncode;
+
+	/*
+	 * Some OSs require directories to be opened read-only whereas other
+	 * systems don't allow us to fsync files opened read-only; so we need both
+	 * cases here.  Using O_RDWR will cause us to fail to fsync files that are
+	 * not writable by our userid, but we assume that's OK.
+	 */
+	flags = PG_BINARY;
+	if (!isdir)
+		flags |= O_RDWR;
+	else
+		flags |= O_RDONLY;
+
+	/*
+	 * Open the file, silently ignoring errors about unreadable files (or
+	 * unsupported operations, e.g. opening a directory under Windows), and
+	 * logging others.
+	 */
+	fd = open(fname, flags);
+	if (fd < 0)
+	{
+		if (isdir && (errno == EISDIR || errno == EACCES))
+			return 0;
+		fprintf(stderr, _("%s: could not open file \"%s\": %s\n"),
+				progname, fname, strerror(errno));
+		return -1;
+	}
+
+	returncode = fsync(fd);
+
+	/*
+	 * Some OSes don't allow us to fsync directories at all, so we can ignore
+	 * those errors. Anything else needs to be reported.
+	 */
+	if (returncode != 0 && !(isdir && errno == EBADF))
+	{
+		fprintf(stderr, _("%s: could not fsync file \"%s\": %s\n"),
+				progname, fname, strerror(errno));
+		close(fd);
+		return -1;
+	}
+
+	close(fd);
+	return 0;
+}
+
+/*
+ * fsync_parent_path -- fsync the parent path of a file or directory
+ *
+ * This is aimed at making file operations persistent on disk in case of
+ * an OS crash or power failure.
+ */
+static int
+fsync_parent_path(const char *fname)
+{
+	char		parentpath[MAXPGPATH];
+
+	strlcpy(parentpath, fname, MAXPGPATH);
+	get_parent_directory(parentpath);
+
+	/*
+	 * get_parent_directory() returns an empty string if the input argument is
+	 * just a file name (see comments in path.c), so handle that as being the
+	 * current directory.
+	 */
+	if (strlen(parentpath) == 0)
+		strlcpy(parentpath, ".", MAXPGPATH);
+
+	if (fsync_fname_ext(parentpath, true) != 0)
+		return -1;
+
+	return 0;
+}
+
+/*
+ * durable_rename -- rename(2) wrapper, issuing fsyncs required for durability
+ *
+ * Wrapper around rename, similar to the backend version.  Note that this
+ * version does not fsync the target file before the rename, as it's unlikely
+ * to be helpful for current and prospective users.
+ */
+static int
+durable_rename(const char *oldfile, const char *newfile)
+{
+	/*
+	 * First fsync the old path, to ensure that it is properly persistent on
+	 * disk.
+	 */
+	if (fsync_fname_ext(oldfile, false) != 0)
+		return -1;
+
+	/* Time to do the real deal... */
+	if (rename(oldfile, newfile) != 0)
+	{
+		fprintf(stderr, _("%s: could not rename file \"%s\" to \"%s\": %s\n"),
+				progname, oldfile, newfile, strerror(errno));
+		return -1;
+	}
+
+	/*
+	 * To guarantee renaming the file is persistent, fsync the file with its
+	 * new name, and its containing directory.
+	 */
+	if (fsync_fname_ext(newfile, false) != 0)
+		return -1;
+
+	if (fsync_parent_path(newfile) != 0)
+		return -1;
+
+	return 0;
+}
+
+/*
  * Receive a log stream starting at the specified position.
  *
  * If sysidentifier is specified, validate that both the system
pg_receivexlog-sync-95.patchinvalid/octet-stream; name=pg_receivexlog-sync-95.patchDownload
diff --git a/src/bin/pg_basebackup/receivelog.c b/src/bin/pg_basebackup/receivelog.c
index 3c60626..a206ba6 100644
--- a/src/bin/pg_basebackup/receivelog.c
+++ b/src/bin/pg_basebackup/receivelog.c
@@ -59,6 +59,9 @@ static long CalculateCopyStreamSleeptime(int64 now, int standby_message_timeout,
 
 static bool ReadEndOfStreamingResult(PGresult *res, XLogRecPtr *startpos,
 						 uint32 *timeline);
+static int	fsync_parent_path(const char *fname);
+static int	fsync_fname_ext(const char *fname, bool isdir);
+static int	durable_rename(const char *oldfile, const char *newfile);
 
 static bool
 mark_file_as_archived(const char *basedir, const char *fname)
@@ -77,17 +80,13 @@ mark_file_as_archived(const char *basedir, const char *fname)
 		return false;
 	}
 
-	if (fsync(fd) != 0)
-	{
-		fprintf(stderr, _("%s: could not fsync file \"%s\": %s\n"),
-				progname, tmppath, strerror(errno));
-
-		close(fd);
+	close(fd);
 
+	if (fsync_fname_ext(tmppath, false) != 0)
 		return false;
-	}
 
-	close(fd);
+	if (fsync_parent_path(tmppath) != 0)
+		return false;
 
 	return true;
 }
@@ -126,6 +125,10 @@ open_walfile(XLogRecPtr startpoint, uint32 timeline, char *basedir,
 	/*
 	 * Verify that the file is either empty (just created), or a complete
 	 * XLogSegSize segment. Anything in between indicates a corrupt file.
+	 *
+	 * XXX: This means that we might not restart if a crash occurs before the
+	 * fsync below. We probably should create the file in a temporary path
+	 * like the backend does...
 	 */
 	if (fstat(f, &statbuf) != 0)
 	{
@@ -139,6 +142,16 @@ open_walfile(XLogRecPtr startpoint, uint32 timeline, char *basedir,
 	{
 		/* File is open and ready to use */
 		walfile = f;
+
+		/*
+		 * fsync, in case of a previous crash between padding and fsyncing the
+		 * file.
+		 */
+		if (fsync_fname_ext(fn, false) != 0)
+			return false;
+		if (fsync_parent_path(fn) != 0)
+			return false;
+
 		return true;
 	}
 	if (statbuf.st_size != 0)
@@ -167,6 +180,17 @@ open_walfile(XLogRecPtr startpoint, uint32 timeline, char *basedir,
 	}
 	free(zerobuf);
 
+	/*
+	 * fsync WAL file and containing directory, to ensure the file is
+	 * persistently created and zeroed. That's particularly important when
+	 * using synchronous mode, where the file is modified and fsynced
+	 * in-place, without a directory fsync.
+	 */
+	if (fsync_fname_ext(fn, false) != 0)
+		return false;
+	if (fsync_parent_path(fn) != 0)
+		return false;
+
 	if (lseek(f, SEEK_SET, 0) != 0)
 	{
 		fprintf(stderr,
@@ -227,10 +251,9 @@ close_walfile(char *basedir, char *partial_suffix, XLogRecPtr pos, bool mark_don
 
 		snprintf(oldfn, sizeof(oldfn), "%s/%s%s", basedir, current_walfile_name, partial_suffix);
 		snprintf(newfn, sizeof(newfn), "%s/%s", basedir, current_walfile_name);
-		if (rename(oldfn, newfn) != 0)
+		if (durable_rename(oldfn, newfn) != 0)
 		{
-			fprintf(stderr, _("%s: could not rename file \"%s\": %s\n"),
-					progname, current_walfile_name, strerror(errno));
+			/* durable_rename produced a log entry */
 			return false;
 		}
 	}
@@ -349,14 +372,6 @@ writeTimeLineHistoryFile(char *basedir, TimeLineID tli, char *filename,
 		return false;
 	}
 
-	if (fsync(fd) != 0)
-	{
-		close(fd);
-		fprintf(stderr, _("%s: could not fsync file \"%s\": %s\n"),
-				progname, tmppath, strerror(errno));
-		return false;
-	}
-
 	if (close(fd) != 0)
 	{
 		fprintf(stderr, _("%s: could not close file \"%s\": %s\n"),
@@ -367,10 +382,9 @@ writeTimeLineHistoryFile(char *basedir, TimeLineID tli, char *filename,
 	/*
 	 * Now move the completed history file into place with its final name.
 	 */
-	if (rename(tmppath, path) < 0)
+	if (durable_rename(tmppath, path) < 0)
 	{
-		fprintf(stderr, _("%s: could not rename file \"%s\" to \"%s\": %s\n"),
-				progname, tmppath, path, strerror(errno));
+		/* durable_rename produced a log entry */
 		return false;
 	}
 
@@ -802,6 +816,132 @@ ReadEndOfStreamingResult(PGresult *res, XLogRecPtr *startpos, uint32 *timeline)
 }
 
 /*
+ * fsync_fname_ext -- Try to fsync a file or directory
+ *
+ * Returns 0 if the operation succeeded, -1 otherwise.
+ *
+ * XXX: This is a near-duplicate of initdb.c's fsync_fname_ext(); they should
+ * be unified into a common place.
+ */
+static int
+fsync_fname_ext(const char *fname, bool isdir)
+{
+	int			fd;
+	int			flags;
+	int			returncode;
+
+	/*
+	 * Some OSs require directories to be opened read-only whereas other
+	 * systems don't allow us to fsync files opened read-only; so we need both
+	 * cases here.  Using O_RDWR will cause us to fail to fsync files that are
+	 * not writable by our userid, but we assume that's OK.
+	 */
+	flags = PG_BINARY;
+	if (!isdir)
+		flags |= O_RDWR;
+	else
+		flags |= O_RDONLY;
+
+	/*
+	 * Open the file, silently ignoring errors about unreadable files (or
+	 * unsupported operations, e.g. opening a directory under Windows), and
+	 * logging others.
+	 */
+	fd = open(fname, flags);
+	if (fd < 0)
+	{
+		if (isdir && (errno == EISDIR || errno == EACCES))
+			return 0;
+		fprintf(stderr, _("%s: could not open file \"%s\": %s\n"),
+				progname, fname, strerror(errno));
+		return -1;
+	}
+
+	returncode = fsync(fd);
+
+	/*
+	 * Some OSes don't allow us to fsync directories at all, so we can ignore
+	 * those errors. Anything else needs to be reported.
+	 */
+	if (returncode != 0 && !(isdir && errno == EBADF))
+	{
+		fprintf(stderr, _("%s: could not fsync file \"%s\": %s\n"),
+				progname, fname, strerror(errno));
+		close(fd);
+		return -1;
+	}
+
+	close(fd);
+	return 0;
+}
+
+/*
+ * fsync_parent_path -- fsync the parent path of a file or directory
+ *
+ * This is aimed at making file operations persistent on disk in case of
+ * an OS crash or power failure.
+ */
+static int
+fsync_parent_path(const char *fname)
+{
+	char		parentpath[MAXPGPATH];
+
+	strlcpy(parentpath, fname, MAXPGPATH);
+	get_parent_directory(parentpath);
+
+	/*
+	 * get_parent_directory() returns an empty string if the input argument is
+	 * just a file name (see comments in path.c), so handle that as being the
+	 * current directory.
+	 */
+	if (strlen(parentpath) == 0)
+		strlcpy(parentpath, ".", MAXPGPATH);
+
+	if (fsync_fname_ext(parentpath, true) != 0)
+		return -1;
+
+	return 0;
+}
+
+/*
+ * durable_rename -- rename(2) wrapper, issuing fsyncs required for durability
+ *
+ * Wrapper around rename, similar to the backend version.  Note that this
+ * version does not fsync the target file before the rename, as it's unlikely
+ * to be helpful for current and prospective users.
+ */
+static int
+durable_rename(const char *oldfile, const char *newfile)
+{
+	/*
+	 * First fsync the old path, to ensure that it is properly persistent on
+	 * disk.
+	 */
+	if (fsync_fname_ext(oldfile, false) != 0)
+		return -1;
+
+	/* Time to do the real deal... */
+	if (rename(oldfile, newfile) != 0)
+	{
+		fprintf(stderr, _("%s: could not rename file \"%s\" to \"%s\": %s\n"),
+				progname, oldfile, newfile, strerror(errno));
+		return -1;
+	}
+
+	/*
+	 * To guarantee renaming the file is persistent, fsync the file with its
+	 * new name, and its containing directory.
+	 */
+	if (fsync_fname_ext(newfile, false) != 0)
+		return -1;
+
+	if (fsync_parent_path(newfile) != 0)
+		return -1;
+
+	return 0;
+}
+
+/*
  * The main loop of ReceiveXlogStream. Handles the COPY stream after
  * initiating streaming with the START_STREAMING command.
  *
#85Michael Paquier
michael.paquier@gmail.com
In reply to: Andres Freund (#83)
1 attachment(s)
Re: silent data loss with ext4 / all current versions

On Mon, Mar 28, 2016 at 8:25 AM, Andres Freund <andres@anarazel.de> wrote:

I've also noticed that

Coming back to this issue because...

a) pg_basebackup doesn't do anything about durability (it probably needs
a very similar patch to the one pg_rewind just received).

I think that one of the QE tests running here just got bitten by that.
A base backup was taken with pg_basebackup and more or less after a VM
was plugged off. The trick is that for pg_basebackup we cannot rely on
initdb: pg_basebackup is a client-side utility. In most of the PG
packages (Fedora, RHEL), it is put on the client-side package, where
initdb is not. So it seems to me that the correct fix is not to use
initdb -S but to have copies of fsync_parent_path, durable_rename and
fsync_fname_ext in streamutil.c, and then we reuse them for both
pg_receivexlog and pg_basebackup. At least that's less risky for
back-branches this way.

b) nor does pg_dump[all]

I have not hacked up that yet, but I would think that we would need
one extra copy of some of those fsync_* routines in src/bin/pg_dump/.
There is another thread for that already... On master I guess we'd end
with something centralized in src/common/, but let's close the
existing holes first.

So we're going to have another round of fsync stuff in the next set of
releases anyway...

The sooner the better I think. Any people caring about this problem
are now limited in using initdb -S after calling pg_basebackup or
pg_dump. That's a solution, though the flushes should be contained
inside each utility.
--
Michael

Attachments:

0001-Issue-fsync-more-carefully-in-pg_receivexlog-and-pg_.patchapplication/x-download; name=0001-Issue-fsync-more-carefully-in-pg_receivexlog-and-pg_.patchDownload
From 43dab7c9c40d3aa385c7e115c137601308b9b00f Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@otacoo.com>
Date: Thu, 12 May 2016 14:50:55 +0900
Subject: [PATCH] Issue fsync more carefully in pg_receivexlog and
 pg_basebackup [-X] stream.

Several places weren't careful about fsyncing in the way. See 1d4a0ab1
and 606e0f98 for details about required fsyns.

This introduces a near-copy of initdb's fsync_fname_ext(), and of the
backend's durable_rename(), fsync_parent_path(). At least the frontend
duplication should be avoided; but that'd end up being hard to
backpatch.
---
 src/bin/pg_basebackup/pg_basebackup.c |  32 +++++++++
 src/bin/pg_basebackup/receivelog.c    |  55 +++++++++------
 src/bin/pg_basebackup/streamutil.c    | 126 ++++++++++++++++++++++++++++++++++
 src/bin/pg_basebackup/streamutil.h    |   4 ++
 4 files changed, 195 insertions(+), 22 deletions(-)

diff --git a/src/bin/pg_basebackup/pg_basebackup.c b/src/bin/pg_basebackup/pg_basebackup.c
index 2927b60..891622d 100644
--- a/src/bin/pg_basebackup/pg_basebackup.c
+++ b/src/bin/pg_basebackup/pg_basebackup.c
@@ -1114,6 +1114,11 @@ ReceiveTarFile(PGconn *conn, PGresult *res, int rownum)
 
 	if (copybuf != NULL)
 		PQfreemem(copybuf);
+
+	/*
+	 * Nothing is synced here for performance reasons, everything is done
+	 * once for all tablespaces at the end.
+	 */
 }
 
 
@@ -1390,6 +1395,18 @@ ReceiveAndUnpackTarFile(PGconn *conn, PGresult *res, int rownum)
 
 	if (basetablespace && writerecoveryconf)
 		WriteRecoveryConf();
+
+	/*
+	 * Sync data directory to ensure that data is safely on disk, this is
+	 * done once for performance reasons. Each tablespace need to be processed
+	 * once as well.
+	 */
+	if (fsync_fname_ext(current_path, true) != 0)
+	{
+		fprintf(stderr, "%s: sync of target directory %s failed\n",
+				progname, current_path);
+		disconnect_and_exit(1);
+	}
 }
 
 /*
@@ -1931,6 +1948,21 @@ BaseBackup(void)
 	PQclear(res);
 	PQfinish(conn);
 
+	/*
+	 * Make data persistent on disk for each tablespace for tar format,
+	 * though nothing can be done when output is written to stdout. In
+	 * plain format each tablespace is synced individually.
+	 */
+	if (format == 't' && strcmp(basedir, "-") != 0)
+	{
+		if (fsync_fname_ext(basedir, true) != 0)
+		{
+			fprintf(stderr, "%s: sync of target directory %s failed\n",
+					progname, basedir);
+			exit(1);
+		}
+	}
+
 	if (verbose)
 		fprintf(stderr, "%s: base backup completed\n", progname);
 }
diff --git a/src/bin/pg_basebackup/receivelog.c b/src/bin/pg_basebackup/receivelog.c
index 595213f..b86c9e3 100644
--- a/src/bin/pg_basebackup/receivelog.c
+++ b/src/bin/pg_basebackup/receivelog.c
@@ -68,17 +68,13 @@ mark_file_as_archived(const char *basedir, const char *fname)
 		return false;
 	}
 
-	if (fsync(fd) != 0)
-	{
-		fprintf(stderr, _("%s: could not fsync file \"%s\": %s\n"),
-				progname, tmppath, strerror(errno));
-
-		close(fd);
+	close(fd);
 
+	if (fsync_fname_ext(tmppath, false) != 0)
 		return false;
-	}
 
-	close(fd);
+	if (fsync_parent_path(tmppath) != 0)
+		return false;
 
 	return true;
 }
@@ -116,6 +112,10 @@ open_walfile(StreamCtl *stream, XLogRecPtr startpoint)
 	/*
 	 * Verify that the file is either empty (just created), or a complete
 	 * XLogSegSize segment. Anything in between indicates a corrupt file.
+	 *
+	 * XXX: This means that we might not restart if a crash occurs before the
+	 * fsync below. We probably should create the file in a temporary path
+	 * like the backend does...
 	 */
 	if (fstat(f, &statbuf) != 0)
 	{
@@ -129,6 +129,16 @@ open_walfile(StreamCtl *stream, XLogRecPtr startpoint)
 	{
 		/* File is open and ready to use */
 		walfile = f;
+
+		/*
+		 * fsync, in case of a previous crash between padding and fsyncing the
+		 * file.
+		 */
+		if (fsync_fname_ext(fn, false) != 0)
+			return false;
+		if (fsync_parent_path(fn) != 0)
+			return false;
+
 		return true;
 	}
 	if (statbuf.st_size != 0)
@@ -157,6 +167,17 @@ open_walfile(StreamCtl *stream, XLogRecPtr startpoint)
 	}
 	free(zerobuf);
 
+	/*
+	 * fsync WAL file and containing directory, to ensure the file is
+	 * persistently created and zeroed. That's particularly important when
+	 * using synchronous mode, where the file is modified and fsynced
+	 * in-place, without a directory fsync.
+	 */
+	if (fsync_fname_ext(fn, false) != 0)
+		return false;
+	if (fsync_parent_path(fn) != 0)
+		return false;
+
 	if (lseek(f, SEEK_SET, 0) != 0)
 	{
 		fprintf(stderr,
@@ -217,10 +238,9 @@ close_walfile(StreamCtl *stream, XLogRecPtr pos)
 
 		snprintf(oldfn, sizeof(oldfn), "%s/%s%s", stream->basedir, current_walfile_name, stream->partial_suffix);
 		snprintf(newfn, sizeof(newfn), "%s/%s", stream->basedir, current_walfile_name);
-		if (rename(oldfn, newfn) != 0)
+		if (durable_rename(oldfn, newfn) != 0)
 		{
-			fprintf(stderr, _("%s: could not rename file \"%s\": %s\n"),
-					progname, current_walfile_name, strerror(errno));
+			/* durable_rename produced a log entry */
 			return false;
 		}
 	}
@@ -338,14 +358,6 @@ writeTimeLineHistoryFile(StreamCtl *stream, char *filename, char *content)
 		return false;
 	}
 
-	if (fsync(fd) != 0)
-	{
-		close(fd);
-		fprintf(stderr, _("%s: could not fsync file \"%s\": %s\n"),
-				progname, tmppath, strerror(errno));
-		return false;
-	}
-
 	if (close(fd) != 0)
 	{
 		fprintf(stderr, _("%s: could not close file \"%s\": %s\n"),
@@ -356,10 +368,9 @@ writeTimeLineHistoryFile(StreamCtl *stream, char *filename, char *content)
 	/*
 	 * Now move the completed history file into place with its final name.
 	 */
-	if (rename(tmppath, path) < 0)
+	if (durable_rename(tmppath, path) < 0)
 	{
-		fprintf(stderr, _("%s: could not rename file \"%s\" to \"%s\": %s\n"),
-				progname, tmppath, path, strerror(errno));
+		/* durable_rename produced a log entry */
 		return false;
 	}
 
diff --git a/src/bin/pg_basebackup/streamutil.c b/src/bin/pg_basebackup/streamutil.c
index 4d1ff90..ece5abd 100644
--- a/src/bin/pg_basebackup/streamutil.c
+++ b/src/bin/pg_basebackup/streamutil.c
@@ -525,3 +525,129 @@ fe_recvint64(char *buf)
 
 	return result;
 }
+
+/*
+ * fsync_fname_ext -- Try to fsync a file or directory
+ *
+ * Returns 0 if the operation succeeded, -1 otherwise.
+ *
+ * XXX: This is a near-duplicate of initdb.c's fsync_fname_ext(); they should
+ * be unified into a common place.
+ */
+int
+fsync_fname_ext(const char *fname, bool isdir)
+{
+	int			fd;
+	int			flags;
+	int			returncode;
+
+	/*
+	 * Some OSs require directories to be opened read-only whereas other
+	 * systems don't allow us to fsync files opened read-only; so we need both
+	 * cases here.  Using O_RDWR will cause us to fail to fsync files that are
+	 * not writable by our userid, but we assume that's OK.
+	 */
+	flags = PG_BINARY;
+	if (!isdir)
+		flags |= O_RDWR;
+	else
+		flags |= O_RDONLY;
+
+	/*
+	 * Open the file, silently ignoring errors about unreadable files (or
+	 * unsupported operations, e.g. opening a directory under Windows), and
+	 * logging others.
+	 */
+	fd = open(fname, flags);
+	if (fd < 0)
+	{
+		if (isdir && (errno == EISDIR || errno == EACCES))
+			return 0;
+		fprintf(stderr, _("%s: could not open file \"%s\": %s\n"),
+				progname, fname, strerror(errno));
+		return -1;
+	}
+
+	returncode = fsync(fd);
+
+	/*
+	 * Some OSes don't allow us to fsync directories at all, so we can ignore
+	 * those errors. Anything else needs to be reported.
+	 */
+	if (returncode != 0 && !(isdir && errno == EBADF))
+	{
+		fprintf(stderr, _("%s: could not fsync file \"%s\": %s\n"),
+				progname, fname, strerror(errno));
+		close(fd);
+		return -1;
+	}
+
+	close(fd);
+	return 0;
+}
+
+/*
+ * fsync_parent_path -- fsync the parent path of a file or directory
+ *
+ * This is aimed at making file operations persistent on disk in case of
+ * an OS crash or power failure.
+ */
+int
+fsync_parent_path(const char *fname)
+{
+	char		parentpath[MAXPGPATH];
+
+	strlcpy(parentpath, fname, MAXPGPATH);
+	get_parent_directory(parentpath);
+
+	/*
+	 * get_parent_directory() returns an empty string if the input argument is
+	 * just a file name (see comments in path.c), so handle that as being the
+	 * current directory.
+	 */
+	if (strlen(parentpath) == 0)
+		strlcpy(parentpath, ".", MAXPGPATH);
+
+	if (fsync_fname_ext(parentpath, true) != 0)
+		return -1;
+
+	return 0;
+}
+
+/*
+ * durable_rename -- rename(2) wrapper, issuing fsyncs required for durability
+ *
+ * Wrapper around rename, similar to the backend version.  Note that this
+ * version does not fsync the target file before the rename, as it's unlikely
+ * to be helpful for current and prospective users.
+ */
+int
+durable_rename(const char *oldfile, const char *newfile)
+{
+	/*
+	 * First fsync the old path, to ensure that it is properly persistent on
+	 * disk.
+	 */
+	if (fsync_fname_ext(oldfile, false) != 0)
+		return -1;
+
+	/* Time to do the real deal... */
+	if (rename(oldfile, newfile) != 0)
+	{
+		fprintf(stderr, _("%s: could not rename file \"%s\" to \"%s\": %s\n"),
+				progname, oldfile, newfile, strerror(errno));
+		return -1;
+	}
+
+	/*
+	 * To guarantee renaming the file is persistent, fsync the file with its
+	 * new name, and its containing directory.
+	 */
+	if (fsync_fname_ext(newfile, false) != 0)
+		return -1;
+
+	if (fsync_parent_path(newfile) != 0)
+		return -1;
+
+	return 0;
+}
diff --git a/src/bin/pg_basebackup/streamutil.h b/src/bin/pg_basebackup/streamutil.h
index d2d5a6d..71ae9ca 100644
--- a/src/bin/pg_basebackup/streamutil.h
+++ b/src/bin/pg_basebackup/streamutil.h
@@ -48,4 +48,8 @@ extern bool feTimestampDifferenceExceeds(int64 start_time, int64 stop_time,
 extern void fe_sendint64(int64 i, char *buf);
 extern int64 fe_recvint64(char *buf);
 
+extern int fsync_parent_path(const char *fname);
+extern int fsync_fname_ext(const char *fname, bool isdir);
+extern int durable_rename(const char *oldfile, const char *newfile);
+
 #endif   /* STREAMUTIL_H */
-- 
2.8.2

#86Michael Paquier
michael.paquier@gmail.com
In reply to: Michael Paquier (#85)
Re: silent data loss with ext4 / all current versions

On Thu, May 12, 2016 at 2:58 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Mon, Mar 28, 2016 at 8:25 AM, Andres Freund <andres@anarazel.de> wrote:

I've also noticed that

Coming back to this issue because...

a) pg_basebackup doesn't do anything about durability (it probably needs
a very similar patch to the one pg_rewind just received).

I think that one of the QE tests running here just got bitten by that.
A base backup was taken with pg_basebackup and more or less after a VM
was plugged off. The trick is that for pg_basebackup we cannot rely on
initdb: pg_basebackup is a client-side utility. In most of the PG
packages (Fedora, RHEL), it is put on the client-side package, where
initdb is not. So it seems to me that the correct fix is not to use
initdb -S but to have copies of fsync_parent_path, durable_rename and
fsync_fname_ext in streamutil.c, and then we reuse them for both
pg_receivexlog and pg_basebackup. At least that's less risky for
back-branches this way.

b) nor does pg_dump[all]

I have not hacked up that yet, but I would think that we would need
one extra copy of some of those fsync_* routines in src/bin/pg_dump/.
There is another thread for that already... On master I guess we'd end
with something centralized in src/common/, but let's close the
existing holes first.

So we're going to have another round of fsync stuff in the next set of
releases anyway...

The sooner the better I think. Any people caring about this problem
are now limited in using initdb -S after calling pg_basebackup or
pg_dump. That's a solution, though the flushes should be contained
inside each utility.

And actually this won't fly high if there is no equivalent of
walkdir() or if the fsync()'s are not applied recursively. On master
at least the refactoring had better be done cleanly first... For the
back branches, we could just have some recursive call like
fsync_recursively and keep that in src/bin/pg_basebackup. Andres, do
you think that this should be part of fe_utils or src/common/? I'd
tend to think the latter is more adapted as there is an equivalent in
the backend. On back-branches, we could just have something like
fsync_recursively that walks though the paths. An even more simple
approach would be to fsync() individually things that have been
written, but that would suck in performance.

Thoughts from others?
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers