Missing data_sync_elevel() for some calls of pg_fsync()?

Started by Michael Paquierabout 6 years ago3 messages
#1Michael Paquier
michael@paquier.xyz
1 attachment(s)

Hi all,

I was just looking at some callers of pg_fsync(), to notice that some
code paths don't use data_sync_elevel(). For some code paths, that's
actually better to never PANIC (say backup_label file, logical
decoding snapshot, lock file where FATAL/LOG are used now, etc.).
However I have spotted three code paths where this is not done and I
think that's not fine:
- 2PC file generated at checkpoint time.
- WAL segment initialization.
- Temporary state file for a replication slot save, which may cause
ERRORs at checkpoint time.

Any thoughts?
--
Michael

Attachments:

pgfsync-panic-v1.patchtext/x-diff; charset=us-asciiDownload
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 529976885f..0cae748348 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -1661,7 +1661,7 @@ RecreateTwoPhaseFile(TransactionId xid, void *content, int len)
 	 */
 	pgstat_report_wait_start(WAIT_EVENT_TWOPHASE_FILE_SYNC);
 	if (pg_fsync(fd) != 0)
-		ereport(ERROR,
+		ereport(data_sync_elevel(ERROR),
 				(errcode_for_file_access(),
 				 errmsg("could not fsync file \"%s\": %m", path)));
 	pgstat_report_wait_end();
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 5f0ee50092..986d95243c 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -3315,7 +3315,7 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
 
 		close(fd);
 		errno = save_errno;
-		ereport(ERROR,
+		ereport(data_sync_elevel(ERROR),
 				(errcode_for_file_access(),
 				 errmsg("could not fsync file \"%s\": %m", tmppath)));
 	}
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 21ae8531b3..68f7ba1247 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1307,7 +1307,7 @@ SaveSlotToPath(ReplicationSlot *slot, const char *dir, int elevel)
 		pgstat_report_wait_end();
 		CloseTransientFile(fd);
 		errno = save_errno;
-		ereport(elevel,
+		ereport(data_sync_elevel(elevel),
 				(errcode_for_file_access(),
 				 errmsg("could not fsync file \"%s\": %m",
 						tmppath)));
#2Thomas Munro
thomas.munro@gmail.com
In reply to: Michael Paquier (#1)
Re: Missing data_sync_elevel() for some calls of pg_fsync()?

On Mon, Dec 2, 2019 at 5:58 PM Michael Paquier <michael@paquier.xyz> wrote:

I was just looking at some callers of pg_fsync(), to notice that some
code paths don't use data_sync_elevel(). For some code paths, that's
actually better to never PANIC (say backup_label file, logical
decoding snapshot, lock file where FATAL/LOG are used now, etc.).
However I have spotted three code paths where this is not done and I
think that's not fine:
- 2PC file generated at checkpoint time.
- WAL segment initialization.
- Temporary state file for a replication slot save, which may cause
ERRORs at checkpoint time.

One of the distinctions I had in mind when reviewing/working on the
PANIC stuff was this:

1. Some code opens a file, writes some stuff to it, closes, and then
fsyncs it, and if that fails and and it ever retries it'll redo all of
those steps again. We know that some kernels might have thrown away
the data, but we don't care about the copy in the kernel's cache
because we'll write it out again next time around.

2. Some code, primarily buffer pool write-back code, writes data out
to the file, then throws away the only copy we have of it other than
the WAL by using the buffer for some other block, and then later
(usually in the checkpointer) fsyncs it. One way to look at it is
that if the fsync fails, the only place left to get that data (which
may represent committed transactions) is the WAL, and the only way to
get it is crash recovery. Another way to look at it is that if we
didn't PANIC, the checkpointer would try again, but it's easily
demonstrable that if it tries again, certain kernels will do nothing
and then tell you that it succeeded, so we need to prevent that or our
checkpoint would be a data-eating lie.

I didn't look closely at your patch, but I think those are category 1,
no? Admittedly there is an argument that we should panic in those
cases too, because otherwise we're second guessing how the kernel
works, and I did make a similar argument for why sync_file_range()
failure is panic-worthy.

#3Craig Ringer
craig@2ndquadrant.com
In reply to: Thomas Munro (#2)
Re: Missing data_sync_elevel() for some calls of pg_fsync()?

On Mon, 2 Dec 2019 at 13:43, Thomas Munro <thomas.munro@gmail.com> wrote:

1. Some code opens a file, writes some stuff to it, closes, and then
fsyncs it, and if that fails and and it ever retries it'll redo all of
those steps again. We know that some kernels might have thrown away
the data, but we don't care about the copy in the kernel's cache
because we'll write it out again next time around.

Can we trust the kernel to be reporting the EIO or ENOSPC only from
writeback buffers for the actual file we're fsync()ing though? Not from
buffers it flushed while performing our fsync() request, failed to flush,
and complained about?

I'm not confident I want to assume that.
--
Craig Ringer http://www.2ndQuadrant.com/
2ndQuadrant - PostgreSQL Solutions for the Enterprise