Tighten error control for OpenTransientFile/CloseTransientFile

Started by Michael Paquierabout 7 years ago12 messageshackers
Jump to latest
#1Michael Paquier
michael@paquier.xyz

Hi all,

Joe's message here has reminded me that we have lacked a lot of error
handling around CloseTransientFile():
/messages/by-id/c49b69ec-e2f7-ff33-4f17-0eaa4f2cef27@joeconway.com

This has been mentioned by Alvaro a couple of months ago (cannot find
the thread about that at quick glance), and I just forgot about it at
that time. Anyway, attached is a patch to do some cleanup for all
that:
- Switch OpenTransientFile to read-only where sufficient.
- Add more error handling for CloseTransientFile
A major take of this patch is to make sure that the new error messages
generated have an elevel consistent with their neighbors.

Just on time for this last CF. Thoughts?
--
Michael

Attachments:

transient-fd-error-v1.patchtext/x-diff; charset=us-asciiDownload+119-28
#2Joe Conway
mail@joeconway.com
In reply to: Michael Paquier (#1)
Re: Tighten error control for OpenTransientFile/CloseTransientFile

On 2/28/19 9:33 PM, Michael Paquier wrote:

Hi all,

Joe's message here has reminded me that we have lacked a lot of error
handling around CloseTransientFile():
/messages/by-id/c49b69ec-e2f7-ff33-4f17-0eaa4f2cef27@joeconway.com

This has been mentioned by Alvaro a couple of months ago (cannot find
the thread about that at quick glance), and I just forgot about it at
that time. Anyway, attached is a patch to do some cleanup for all
that:
- Switch OpenTransientFile to read-only where sufficient.
- Add more error handling for CloseTransientFile
A major take of this patch is to make sure that the new error messages
generated have an elevel consistent with their neighbors.

Just on time for this last CF. Thoughts?

Seems like it would be better to modify the arguments to
CloseTransientFile() to include the filename being closed, errorlevel,
and fail_on_error or something similar. Then all the repeated ereport
stanzas could be eliminated.

Joe

--
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development

#3Michael Paquier
michael@paquier.xyz
In reply to: Joe Conway (#2)
Re: Tighten error control for OpenTransientFile/CloseTransientFile

On Fri, Mar 01, 2019 at 05:05:54PM -0500, Joe Conway wrote:

Seems like it would be better to modify the arguments to
CloseTransientFile() to include the filename being closed, errorlevel,
and fail_on_error or something similar. Then all the repeated ereport
stanzas could be eliminated.

Sure. Now some code paths close file descriptors without having at
hand the file name, which would mean that we'd need to pass NULL as
argument in this case. That's not really elegant in my opinion. And
having a consistent mapping with the system's close() is not really
bad to me either..
--
Michael

#4Georgios Kokolatos
gkokolatos@protonmail.com
In reply to: Michael Paquier (#3)
Re: Tighten error control for OpenTransientFile/CloseTransientFile

Overall the patch looks good and according to the previous discussion fulfils its purpose.

It might be worthwhile to also check for errors on close in SaveSlotToPath().

pgstat_report_wait_end();

CloseTransientFile(fd);

/* rename to permanent file, fsync file and directory */
if (rename(tmppath, path) != 0)

#5Robert Haas
robertmhaas@gmail.com
In reply to: Joe Conway (#2)
Re: Tighten error control for OpenTransientFile/CloseTransientFile

On Fri, Mar 1, 2019 at 5:06 PM Joe Conway <mail@joeconway.com> wrote:

Seems like it would be better to modify the arguments to
CloseTransientFile() to include the filename being closed, errorlevel,
and fail_on_error or something similar. Then all the repeated ereport
stanzas could be eliminated.

Hmm. I'm not sure that really saves much in terms of notation, and
it's less flexible.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#6Michael Paquier
michael@paquier.xyz
In reply to: Georgios Kokolatos (#4)
Re: Tighten error control for OpenTransientFile/CloseTransientFile

On Wed, Mar 06, 2019 at 02:54:52PM +0000, Georgios Kokolatos wrote:

Overall the patch looks good and according to the previous
discussion fulfils its purpose.

It might be worthwhile to also check for errors on close in
SaveSlotToPath().

Thanks for the feedback, added. I have spent some time
double-checking this stuff, and noticed that the new errors in
StartupReplicationOrigin() and CheckPointReplicationOrigin() should be
switched from ERROR to PANIC to be consistent. One message in
dsm_impl_mmap() was not consistent either.

Are there any objections if I commit this patch?
--
Michael

Attachments:

transient-fd-error-v2.patchtext/x-diff; charset=us-asciiDownload+129-30
#7Georgios Kokolatos
gkokolatos@protonmail.com
In reply to: Michael Paquier (#6)
Re: Tighten error control for OpenTransientFile/CloseTransientFile

The following review has been posted through the commitfest application:
make installcheck-world: tested, failed
Implements feature: not tested
Spec compliant: not tested
Documentation: not tested

The second version of this patch seems to be in order and ready for committer.

Thank you for taking the time to code!

The new status of this patch is: Ready for Committer

#8Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Michael Paquier (#6)
Re: Tighten error control for OpenTransientFile/CloseTransientFile

On 2019-Mar-07, Michael Paquier wrote:

#else
-	close(fd);
+	if (close(fd))
+	{
+		fprintf(stderr, _("%s: could not close file \"%s\": %s"),
+				progname, ControlFilePath, strerror(errno));
+		exit(EXIT_FAILURE);
+	}
#endif

I think this one needs a terminating \n.

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#9Michael Paquier
michael@paquier.xyz
In reply to: Alvaro Herrera (#8)
Re: Tighten error control for OpenTransientFile/CloseTransientFile

On Thu, Mar 07, 2019 at 10:00:05PM -0300, Alvaro Herrera wrote:

I think this one needs a terminating \n.

Argh... Thanks for the lookup, Alvaro.
--
Michael

#10Michael Paquier
michael@paquier.xyz
In reply to: Michael Paquier (#9)
Re: Tighten error control for OpenTransientFile/CloseTransientFile

On Fri, Mar 08, 2019 at 10:23:24AM +0900, Michael Paquier wrote:

Argh... Thanks for the lookup, Alvaro.

And committed, after an extra pass to beautify the whole experience.
--
Michael

#11Andres Freund
andres@anarazel.de
In reply to: Michael Paquier (#6)
Re: Tighten error control for OpenTransientFile/CloseTransientFile

Hi,

On 2019-03-07 10:56:25 +0900, Michael Paquier wrote:

diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index f5cf9ffc9c..bce4274362 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -1202,7 +1202,10 @@ heap_xlog_logical_rewrite(XLogReaderState *r)
errmsg("could not fsync file \"%s\": %m", path)));
pgstat_report_wait_end();
-	CloseTransientFile(fd);
+	if (CloseTransientFile(fd))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not close file \"%s\": %m", path)));
}

...

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 64679dd2de..21986e48fe 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -1297,7 +1297,11 @@ ReadTwoPhaseFile(TransactionId xid, bool missing_ok)
}
pgstat_report_wait_end();
-	CloseTransientFile(fd);
+
+	if (CloseTransientFile(fd))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not close file \"%s\": %m", path)));
hdr = (TwoPhaseFileHeader *) buf;
if (hdr->magic != TWOPHASE_MAGIC)
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 0fdd82a287..c7047738b6 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -3469,7 +3469,10 @@ XLogFileCopy(XLogSegNo destsegno, TimeLineID srcTLI, XLogSegNo srcsegno,
(errcode_for_file_access(),
errmsg("could not close file \"%s\": %m", tmppath)));
-	CloseTransientFile(srcfd);
+	if (CloseTransientFile(srcfd))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not close file \"%s\": %m", path)));

/*
* Now move the segment into place with its final name.

...

This seems like an odd set of changes to me. What is this supposed to
buy us? The commit message says:
2) When opening transient files, it is up to the caller to close the
file descriptors opened. In error code paths, CloseTransientFile() gets
called to clean up things before issuing an error. However in normal
exit paths, a lot of callers of CloseTransientFile() never actually
reported errors, which could leave a file descriptor open without
knowing about it. This is an issue I complained about a couple of
times, but never had the courage to write and submit a patch, so here we
go.

but that reasoning seems bogus to me. For one, on just about any
platform close always closes the fd, even when returning an error
(unless you pass in a bad fd, in which case it obviously doesn't). So
the reasoning that this fixes unnoticed fd leaks doesn't really seem to
make sense. For another, even if it did, throwing an ERROR seems to
achieve very little: We continue with a leaked fd *AND* we cause the
operation to error out.

I can see reasoning for:
- LOG, so it can be noticed, but operations continue to work
- FATAL, to fix the leak
- PANIC, so we recover from the problem, in case of the close indicating
a durability issue

commit 9ccdd7f66e3324d2b6d3dec282cfa9ff084083f1
Author: Thomas Munro <tmunro@postgresql.org>
Date: 2018-11-19 13:31:10 +1300

PANIC on fsync() failure.

but ERROR seems to have very little going for it.

The durability argument doesn't seem to apply for the cases where we
previously fsynced the file, a significant fraction of the locations you
touched.

And if your goal was just to achieve consistency, I also don't
understand, because you left plenty close()'s unchecked? E.g. you added
an error check in get_controlfile(), but not one in
ReadControlFile(). alterSystemSetConfigFile() writes, but you didn't add
one.

Greetings,

Andres Freund

#12Michael Paquier
michael@paquier.xyz
In reply to: Andres Freund (#11)
Re: Tighten error control for OpenTransientFile/CloseTransientFile

On Fri, Oct 04, 2019 at 01:39:38PM -0700, Andres Freund wrote:

but that reasoning seems bogus to me. For one, on just about any
platform close always closes the fd, even when returning an error
(unless you pass in a bad fd, in which case it obviously doesn't). So
the reasoning that this fixes unnoticed fd leaks doesn't really seem to
make sense. For another, even if it did, throwing an ERROR seems to
achieve very little: We continue with a leaked fd *AND* we cause the
operation to error out.

I have read again a couple of times the commit log, and this mentions
to let users know that a fd is leaking, not that it fixes things.
Still we get to know about it, while previously it was not possible.
In some cases we may see errors in close() after a previous write(2).
Of course this does not apply to all the code paths patched here, but
it seems to me that's a good habit to spread, no?

I can see reasoning for:
- LOG, so it can be noticed, but operations continue to work
- FATAL, to fix the leak
- PANIC, so we recover from the problem, in case of the close indicating
a durability issue

LOG or WARNING would not be visible enough and would likely be skipped
by users. Not sure that this justifies a FATAL either, and PANIC
would cause more harm than necessary, so for most of them ERROR
sounded like a good compromise, still the elevel choice is not
innocent depending on the code paths patched, because the elevel used
is consistent with the error handling of the surroundings.

And if your goal was just to achieve consistency, I also don't
understand, because you left plenty close()'s unchecked? E.g. you added
an error check in get_controlfile(), but not one in
ReadControlFile(). alterSystemSetConfigFile() writes, but you didn't add
one.

Because I have not considered these when looking at transient files.
That may be worth an extra lookup.
--
Michael