fd.c: flush data problems on osx
Hi
Current implementation of pg_flush_data when called with zero offset and zero nbytes is assumed to flush all file. In osx it uses mmap with these arguments, but according to man:
"[EINVAL] The len argument was negative or zero. Historically, the system call would not return an
error if the argument was zero. See other potential additional restrictions in the COMPAT-
IBILITY section below."
so it is generate a lot of warnings:
"WARNING: could not mmap while flushing dirty data: Invalid argument"
Call to pg_flush_data with zero offset and nbytes happens when replica starts from base backup and fsync=on. TAP test to reproduce is attached.
One possible solution for that is just fallback to pg_fdatasync in case when offset = nbytes = 0.
Also there are no default ifdef inside this function, is there any check that will guarantee that pg_flush_data will not end up with empty body on some platform?
---
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
Attachments:
flushdata.patchapplication/octet-stream; name=flushdata.patchDownload
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 3e02dce..b6eebb8 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -406,6 +406,16 @@ pg_flush_data(int fd, off_t offset, off_t nbytes)
if (!enableFsync)
return;
+ /* Flush whole file. */
+ if ((offset == 0) && (nbytes == 0))
+ {
+ if (pg_fdatasync(fd) != 0)
+ ereport(WARNING,
+ (errcode_for_file_access(),
+ errmsg("could not flush dirty data: %m")));
+ return;
+ }
+
/*
* XXX: compile all alternatives, to find portability problems more easily
*/
Hi,
On 2016-03-17 20:09:53 +0300, Stas Kelvich wrote:
Current implementation of pg_flush_data when called with zero offset and zero nbytes is assumed to flush all file. In osx it uses mmap with these arguments, but according to man:
"[EINVAL] The len argument was negative or zero. Historically, the system call would not return an
error if the argument was zero. See other potential additional restrictions in the COMPAT-
IBILITY section below."so it is generate a lot of warnings:
"WARNING: could not mmap while flushing dirty data: Invalid argument"
Hm, yea, that's buggy.
One possible solution for that is just fallback to pg_fdatasync in case when offset = nbytes = 0.
Hm, that's a bit heavyweight. I'd rather do an lseek(SEEK_END) to get
the file size. Could you test that?
Also there are no default ifdef inside this function, is there any
check that will guarantee that pg_flush_data will not end up with
empty body on some platform?
There doesn't need to be - it's purely "advisory", i.e. just an
optimization.
Greetings,
Andres Freund
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 17 Mar 2016, at 20:23, Andres Freund <andres@anarazel.de> wrote:
Also there are no default ifdef inside this function, is there any
check that will guarantee that pg_flush_data will not end up with
empty body on some platform?There doesn't need to be - it's purely "advisory", i.e. just an
optimization.
Ah, okay, then I misinterpret purpose of that function and it shouldn’t be forced to sync.
One possible solution for that is just fallback to pg_fdatasync in case when offset = nbytes = 0.
Hm, that's a bit heavyweight. I'd rather do an lseek(SEEK_END) to get
the file size. Could you test that?
It looks like OSX mmap raises EINVAL when length isn’t aligned to pagesize while manual says it can be of arbitrary length, so i aligned it.
Also there were call to mmap with PROT_READ | PROT_WRITE, but when called from pre_sync_fname file descriptor is just O_RDONLY, so i changed mmap mode to PROT_READ — seems that PROT_WRITE wasn’t needed anyway.
And all of that reduces number of warnings in order of magnitude but there are still some and I don’t yet understand why are they happening.
Greetings,
Andres Freund
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
---
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
Attachments:
flushdata.v2.patchapplication/octet-stream; name=flushdata.v2.patchDownload
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 3e02dce..e0e8015 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -451,8 +451,20 @@ pg_flush_data(int fd, off_t offset, off_t nbytes)
* We map the file (mmap()), tell the kernel to sync back the contents
* (msync()), and then remove the mapping again (munmap()).
*/
+
+ /* mmap() need exact length when we want to map whole file */
+ if ((offset == 0) && (nbytes == 0))
+ {
+ int pagesize = getpagesize();
+ nbytes = (lseek(fd, offset, SEEK_END)/pagesize + 1)*pagesize;
+ if (nbytes < 0)
+ ereport(WARNING,
+ (errcode_for_file_access(),
+ errmsg("could not determine dirty data size: %m")));
+ }
+
p = mmap(NULL, nbytes,
- PROT_READ | PROT_WRITE, MAP_SHARED,
+ PROT_READ, MAP_SHARED,
fd, offset);
if (p == MAP_FAILED)
{
On 18 Mar 2016, at 14:45, Stas Kelvich <s.kelvich@postgrespro.ru> wrote:
One possible solution for that is just fallback to pg_fdatasync in case when offset = nbytes = 0.
Hm, that's a bit heavyweight. I'd rather do an lseek(SEEK_END) to get
the file size. Could you test that?It looks like OSX mmap raises EINVAL when length isn’t aligned to pagesize while manual says it can be of arbitrary length, so i aligned it.
Also there were call to mmap with PROT_READ | PROT_WRITE, but when called from pre_sync_fname file descriptor is just O_RDONLY, so i changed mmap mode to PROT_READ — seems that PROT_WRITE wasn’t needed anyway.And all of that reduces number of warnings in order of magnitude but there are still some and I don’t yet understand why are they happening.
I’ve spend some more time on this issue and found that remaining warnings were caused by mmap-ing directories — that raises EINVAL in OSX (probably not only OSX, but I didn’t tried).
So i’ve skipped mmap for dirs and now restore happens without warnings. Also I’ve fixed wrong error check that was in previous version of patch.
Attachments:
flushdata.v3.patchapplication/octet-stream; name=flushdata.v3.patchDownload
diff --git a/src/backend/storage/file/copydir.c b/src/backend/storage/file/copydir.c
index a51ee81..d80d18a 100644
--- a/src/backend/storage/file/copydir.c
+++ b/src/backend/storage/file/copydir.c
@@ -192,7 +192,7 @@ copy_file(char *fromfile, char *tofile)
* cache and hopefully get the kernel to start writing them out before
* the fsync comes.
*/
- pg_flush_data(dstfd, offset, nbytes);
+ pg_flush_data(dstfd, offset, nbytes, false);
}
if (CloseTransientFile(dstfd))
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 3e02dce..f5da08c 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -395,8 +395,10 @@ pg_fdatasync(int fd)
* flushed.
*/
void
-pg_flush_data(int fd, off_t offset, off_t nbytes)
+pg_flush_data(int fd, off_t offset, off_t nbytes, bool isdir)
{
+ (void) isdir; /* this can be unused on some archs */
+
/*
* Right now file flushing is primarily used to avoid making later
* fsync()/fdatasync() calls have a less impact. Thus don't trigger
@@ -451,8 +453,31 @@ pg_flush_data(int fd, off_t offset, off_t nbytes)
* We map the file (mmap()), tell the kernel to sync back the contents
* (msync()), and then remove the mapping again (munmap()).
*/
+
+ /* mmap() will not work with dirs */
+ if (isdir)
+ return;
+
+ /* mmap() need exact length when we want to map whole file */
+ if ((offset == 0) && (nbytes == 0))
+ {
+ int pagesize = sysconf(_SC_PAGESIZE);
+
+ nbytes = lseek(fd, 0, SEEK_END);
+ if (nbytes < 0)
+ ereport(WARNING,
+ (errcode_for_file_access(),
+ errmsg("could not determine dirty data size: %m")));
+
+ /* aling to pagesize with underestimation */
+ nbytes = (nbytes/pagesize)*pagesize;
+
+ if(nbytes == 0)
+ return;
+ }
+
p = mmap(NULL, nbytes,
- PROT_READ | PROT_WRITE, MAP_SHARED,
+ PROT_READ, MAP_SHARED,
fd, offset);
if (p == MAP_FAILED)
{
@@ -1514,7 +1539,7 @@ FileWriteback(File file, off_t offset, int amount)
if (returnCode < 0)
return;
- pg_flush_data(VfdCache[file].fd, offset, amount);
+ pg_flush_data(VfdCache[file].fd, offset, amount, false);
}
int
@@ -2920,7 +2945,7 @@ pre_sync_fname(const char *fname, bool isdir, int elevel)
* pg_flush_data() ignores errors, which is ok because this is only a
* hint.
*/
- pg_flush_data(fd, 0, 0);
+ pg_flush_data(fd, 0, 0, isdir);
(void) CloseTransientFile(fd);
}
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index be24369..60246f7 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -116,7 +116,7 @@ extern int pg_fsync(int fd);
extern int pg_fsync_no_writethrough(int fd);
extern int pg_fsync_writethrough(int fd);
extern int pg_fdatasync(int fd);
-extern void pg_flush_data(int fd, off_t offset, off_t amount);
+extern void pg_flush_data(int fd, off_t offset, off_t amount, bool isdir);
extern void fsync_fname(const char *fname, bool isdir);
extern int durable_rename(const char *oldfile, const char *newfile, int loglevel);
extern int durable_link_or_rename(const char *oldfile, const char *newfile, int loglevel);
On 2016-03-21 14:46:09 +0300, Stas Kelvich wrote:
On 18 Mar 2016, at 14:45, Stas Kelvich <s.kelvich@postgrespro.ru> wrote:
One possible solution for that is just fallback to pg_fdatasync in case when offset = nbytes = 0.
Hm, that's a bit heavyweight. I'd rather do an lseek(SEEK_END) to get
the file size. Could you test that?It looks like OSX mmap raises EINVAL when length isn’t aligned to pagesize while manual says it can be of arbitrary length, so i aligned it.
Also there were call to mmap with PROT_READ | PROT_WRITE, but when called from pre_sync_fname file descriptor is just O_RDONLY, so i changed mmap mode to PROT_READ — seems that PROT_WRITE wasn’t needed anyway.And all of that reduces number of warnings in order of magnitude but there are still some and I don’t yet understand why are they happening.
I’ve spend some more time on this issue and found that remaining
warnings were caused by mmap-ing directories — that raises EINVAL in
OSX (probably not only OSX, but I didn’t tried).
So i’ve skipped mmap for dirs and now restore happens without
warnings. Also I’ve fixed wrong error check that was in previous
version of patch.
Hm. I think we should rather just skip calling pg_flush_data in the
directory case, that very likely isn't beneficial on any OS.
I think we still need to fix the mmap() implementation to support the
offset = 0, nbytes = 0 case (via fseek(SEEK_END).
Greetings,
Andres Freund
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 21 Mar 2016, at 14:53, Andres Freund <andres@anarazel.de> wrote:
Hm. I think we should rather just skip calling pg_flush_data in the
directory case, that very likely isn't beneficial on any OS.
Seems reasonable, changed.
I think we still need to fix the mmap() implementation to support the
offset = 0, nbytes = 0 case (via fseek(SEEK_END).
It is already in this diff. I’ve added this few messages ago.
Attachments:
flushdata.v4.patchapplication/octet-stream; name=flushdata.v4.patchDownload
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 3e02dce..2a923d3 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -451,8 +451,27 @@ pg_flush_data(int fd, off_t offset, off_t nbytes)
* We map the file (mmap()), tell the kernel to sync back the contents
* (msync()), and then remove the mapping again (munmap()).
*/
+
+ /* mmap() need exact length when we want to map whole file */
+ if ((offset == 0) && (nbytes == 0))
+ {
+ int pagesize = sysconf(_SC_PAGESIZE);
+
+ nbytes = lseek(fd, 0, SEEK_END);
+ if (nbytes < 0)
+ ereport(WARNING,
+ (errcode_for_file_access(),
+ errmsg("could not determine dirty data size: %m")));
+
+ /* aling to pagesize with underestimation */
+ nbytes = (nbytes/pagesize)*pagesize;
+
+ if (nbytes == 0)
+ return;
+ }
+
p = mmap(NULL, nbytes,
- PROT_READ | PROT_WRITE, MAP_SHARED,
+ PROT_READ, MAP_SHARED,
fd, offset);
if (p == MAP_FAILED)
{
@@ -2920,7 +2939,8 @@ pre_sync_fname(const char *fname, bool isdir, int elevel)
* pg_flush_data() ignores errors, which is ok because this is only a
* hint.
*/
- pg_flush_data(fd, 0, 0);
+ if (!isdir)
+ pg_flush_data(fd, 0, 0);
(void) CloseTransientFile(fd);
}
On Mon, Mar 21, 2016 at 03:09:56PM +0300, Stas Kelvich wrote:
On 21 Mar 2016, at 14:53, Andres Freund <andres@anarazel.de> wrote:
Hm. I think we should rather just skip calling pg_flush_data in the
directory case, that very likely isn't beneficial on any OS.Seems reasonable, changed.
I think we still need to fix the mmap() implementation to support the
offset = 0, nbytes = 0 case (via fseek(SEEK_END).It is already in this diff. I’ve added this few messages ago.
[This is a generic notification.]
The above-described topic is currently a PostgreSQL 9.6 open item. Andres,
since you committed the patch believed to have created it, you own this open
item. If that responsibility lies elsewhere, please let us know whose
responsibility it is to fix this. Since new open items may be discovered at
any time and I want to plan to have them all fixed well in advance of the ship
date, I will appreciate your efforts toward speedy resolution. Please
present, within 72 hours, a plan to fix the defect within seven days of this
message. Thanks.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Mar 21, 2016 at 9:09 PM, Stas Kelvich <s.kelvich@postgrespro.ru> wrote:
On 21 Mar 2016, at 14:53, Andres Freund <andres@anarazel.de> wrote:
Hm. I think we should rather just skip calling pg_flush_data in the
directory case, that very likely isn't beneficial on any OS.Seems reasonable, changed.
I think we still need to fix the mmap() implementation to support the
offset = 0, nbytes = 0 case (via fseek(SEEK_END).It is already in this diff. I’ve added this few messages ago.
A similar change seems to be needed in initdb.c's pre_sync_fname.
--
Michael
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Michael Paquier <michael.paquier@gmail.com> writes:
On Mon, Mar 21, 2016 at 9:09 PM, Stas Kelvich <s.kelvich@postgrespro.ru> wrote:
On 21 Mar 2016, at 14:53, Andres Freund <andres@anarazel.de> wrote:
I think we still need to fix the mmap() implementation to support the
offset = 0, nbytes = 0 case (via fseek(SEEK_END).
It is already in this diff. I’ve added this few messages ago.
There are bigger issues in this code, actually, to wit assuming that
it should always be possible to mmap the target region. That's patently
false on 32-bit machines, where you likely won't have a gigabyte of free
address space.
For this reason, I think that issuing a WARNING on mmap failure is a
damfool idea, and giving up on the flush attempt even more so. We
should just silently fall through to the next implementation if we
cannot mmap the target region.
While I'm whinging: what happens when off_t is wider than size_t? It's
entirely possible that the user has configured the relation segment size
to more than 4GB, so I do not think that's academic. I think we also need
a test for length > SIZE_T_MAX and then fall through to another
implementation, rather than mapping and msync'ing some initial fragment of
the file, which is what will happen now.
A similar change seems to be needed in initdb.c's pre_sync_fname.
Hmm, do we need to move this logic into src/common?
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
I wrote:
Michael Paquier <michael.paquier@gmail.com> writes:
A similar change seems to be needed in initdb.c's pre_sync_fname.
Hmm, do we need to move this logic into src/common?
I concluded that sharing the code would be more trouble than it's worth,
because initdb.c's version of this is in fact not broken: it was never
taught about mmap, and it doesn't need to be IMO, because it's not that
performance-critical.
I fixed the other things I complained about and pushed this. Please
check to see that the committed patch resolves your original problem.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2016-04-13 17:21:01 -0400, Tom Lane wrote:
I concluded that sharing the code would be more trouble than it's worth,
because initdb.c's version of this is in fact not broken: it was never
taught about mmap, and it doesn't need to be IMO, because it's not that
performance-critical.
Agreed, there seems little immediate reason to go through additional
trouble. The issue of posix_fadvise(DONTNEED) dropping the cache seems
much less severe for initdb, than for the flushing patch (where it'd
obviously hurt).
It's probably good to unify the paths at some point in the future
though. We're lacking fsyncs in a bunch of relatively important paths
(WIP patch for most of them posted), and we end up needing fsync (and
probably the flushing logic) in the backend, initdb,
pg_basebackup/pg_recvlogical, pg_rewind and possibly some others that I
forgot about.
I fixed the other things I complained about and pushed this.
Thanks for taking care of it!
Please check to see that the committed patch resolves your original problem.
I looked through the commit, and I only have one very minor nitpick:
It might be good if
+ * We compile all alternatives that are supported on the current platform,
+ * to find portability problems more easily.
noted that we also fall through the approaches.
I'm not entirely sure what
+ /*
+ * Caution: do not call pg_flush_data with amount = 0, it could trash the
+ * file's seek position.
+ */
+ if (amount <= 0)
+ return;
+
is about?
- Andres
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Andres Freund <andres@anarazel.de> writes:
I'm not entirely sure what
+ /* + * Caution: do not call pg_flush_data with amount = 0, it could trash the + * file's seek position. + */ + if (amount <= 0) + return; +
is about?
fd.c tracks seek position for open files. I'm not sure that that
function can get called with amount == 0, but if it did, the caller
would certainly not be expecting the file position to change.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2016-04-13 17:44:41 -0400, Tom Lane wrote:
Andres Freund <andres@anarazel.de> writes:
I'm not entirely sure what
+ /* + * Caution: do not call pg_flush_data with amount = 0, it could trash the + * file's seek position. + */ + if (amount <= 0) + return; +is about?
fd.c tracks seek position for open files. I'm not sure that that
function can get called with amount == 0, but if it did, the caller
would certainly not be expecting the file position to change.
Ok, fair enough. (And no, it should currently be never called that way)
Andres
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Andres Freund <andres@anarazel.de> writes:
On 2016-04-13 17:44:41 -0400, Tom Lane wrote:
fd.c tracks seek position for open files. I'm not sure that that
function can get called with amount == 0, but if it did, the caller
would certainly not be expecting the file position to change.
Ok, fair enough. (And no, it should currently be never called that way)
BTW, I just noticed another issue here, which is that FileWriteback
and the corresponding smgr routines are declared with bogusly narrow
"amount" arguments --- eg, it's silly that FileWriteback only takes
an int amount. I think this code could be actively broken for
relation segment sizes exceeding 2GB, and even if it isn't, we should
define the functions correctly the first time.
Will fix the function definitions, but I'm kind of wondering exactly how
many times the inner loop in IssuePendingWritebacks() could possibly
iterate ...
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2016-04-13 18:09:18 -0400, Tom Lane wrote:
Andres Freund <andres@anarazel.de> writes:
On 2016-04-13 17:44:41 -0400, Tom Lane wrote:
fd.c tracks seek position for open files. I'm not sure that that
function can get called with amount == 0, but if it did, the caller
would certainly not be expecting the file position to change.Ok, fair enough. (And no, it should currently be never called that way)
BTW, I just noticed another issue here, which is that FileWriteback
and the corresponding smgr routines are declared with bogusly narrow
"amount" arguments --- eg, it's silly that FileWriteback only takes
an int amount.
Well, I modeled it after the nearby routines (like FileRead), which all
only take an amount in int. Now there might be less reason to read a lot
of data at once, than to flush large amounts; but it still didn't seem
necessary to break with the rest of the functions in the file.
I think this code could be actively broken for relation segment sizes
exceeding 2GB, and even if it isn't, we should define the functions
correctly the first time.
I don't think it's actively problematic, ->max_pending (and thus
nr_pending) is limited by #define WRITEBACK_MAX_PENDING_FLUSHES 256
(although I wondered whether we should increase the limit a bit
further). Even with a blocksize of 32768, that's pretty far away from
exceeding INT_MAX.
Will fix the function definitions, but I'm kind of wondering exactly how
many times the inner loop in IssuePendingWritebacks() could possibly
iterate ...
WRITEBACK_MAX_PENDING_FLUSHES/256, due to the limitation mentioned
above.
Greetings,
Andres Freund
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Andres Freund <andres@anarazel.de> writes:
On 2016-04-13 18:09:18 -0400, Tom Lane wrote:
BTW, I just noticed another issue here, which is that FileWriteback
and the corresponding smgr routines are declared with bogusly narrow
"amount" arguments --- eg, it's silly that FileWriteback only takes
an int amount.
Well, I modeled it after the nearby routines (like FileRead), which all
only take an amount in int. Now there might be less reason to read a lot
of data at once, than to flush large amounts; but it still didn't seem
necessary to break with the rest of the functions in the file.
Well, those APIs are pretty historical. I think it's useful to get
new ones correct from the outset.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers