pread() and pwrite()
Hello hackers,
A couple of years ago, Oskari Saarenmaa proposed a patch[1]/messages/by-id/a86bd200-ebbe-d829-e3ca-0c4474b2fcb7@ohmu.fi to adopt
$SUBJECT. Last year, Tobias Oberstein argued again that we should do
that[2]/messages/by-id/b8748d39-0b19-0514-a1b9-4e5a28e6a208@gmail.com. At the end of that thread there was a +1 from multiple
committers in support of getting it done for PostgreSQL 12. Since
Oskari hasn't returned, I decided to dust off his patch and start a
new thread.
Here is a rebase and an initial review. I plan to address the
problems myself, unless Oskari wants to do that in which case I'll
happily review and hopefully eventually commit it. It's possible I
introduced new bugs while rebasing since basically everything moved
around, but it passes check-world here with and without HAVE_PREAD and
HAVE_PWRITE defined.
Let me summarise what's going on in the patch:
1. FileRead() and FileWrite() are replaced with FileReadAt() and
FileWriteAt(), taking a new offset argument. Now we can do one
syscall instead of two for random reads and writes.
2. fd.c no longer tracks seek position, so we lose a bunch of cruft
from the vfd machinery. FileSeek() now only supports SEEK_SET and
SEEK_END.
This is taking the position that we no longer want to be able to do
file IO with implicit positioning at all. I think that seems
reasonable: it's nice to drop all that position tracking and caching
code, and the seek-based approach would be incompatible with any
multithreading plans. I'm not even sure I'd bother adding "At" to the
function names. If there are any extensions that want the old
interface they will fail to compile either way. Note that the BufFile
interface remains the same, so hopefully that covers many use cases.
I guess the only remaining reason to use FileSeek() is to get the file
size? So I wonder why SEEK_SET remains valid in the patch... if my
suspicion is correct that only SEEK_END still has a reason to exist,
perhaps we should just kill FileSeek() and add FileSize() or something
instead?
pgstat_report_wait_start(wait_event_info);
+#ifdef HAVE_PREAD
+ returnCode = pread(vfdP->fd, buffer, amount, offset);
+#else
+ lseek(VfdCache[file].fd, offset, SEEK_SET);
returnCode = read(vfdP->fd, buffer, amount);
+#endif
pgstat_report_wait_end();
This obviously lacks error handling for lseek().
I wonder if anyone would want separate wait_event tracking for the
lseek() (though this codepath would be used by almost nobody,
especially if someone adds Windows support, so it's probably not worth
bothering with).
I suppose we could assume that if you have pread() you must also have
pwrite() and save on ./configure cycles.
I will add this to the next Festival of Commits, though clearly it
needs a bit more work before the festivities begin.
[1]: /messages/by-id/a86bd200-ebbe-d829-e3ca-0c4474b2fcb7@ohmu.fi
[2]: /messages/by-id/b8748d39-0b19-0514-a1b9-4e5a28e6a208@gmail.com
--
Thomas Munro
http://www.enterprisedb.com
Attachments:
0001-Use-pread-pwrite-instead-of-lseek-read-write-v1.patchapplication/octet-stream; name=0001-Use-pread-pwrite-instead-of-lseek-read-write-v1.patchDownload+73-258
On Thu, Jul 12, 2018 at 1:55 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
I guess the only remaining reason to use FileSeek() is to get the file
size? So I wonder why SEEK_SET remains valid in the patch... if my
suspicion is correct that only SEEK_END still has a reason to exist,
perhaps we should just kill FileSeek() and add FileSize() or something
instead?
Done.
pgstat_report_wait_start(wait_event_info); +#ifdef HAVE_PREAD + returnCode = pread(vfdP->fd, buffer, amount, offset); +#else + lseek(VfdCache[file].fd, offset, SEEK_SET); returnCode = read(vfdP->fd, buffer, amount); +#endif pgstat_report_wait_end();This obviously lacks error handling for lseek().
Fixed.
Updated the main WAL IO routines to use pread()/pwrite() too.
Not super heavily tested yet.
An idea for how to handle Windows, in a follow-up patch: add a file
src/backend/port/win32/file.c that defines pgwin32_pread() and
pgwin32_pwrite(), wrapping WriteFile()/ReadFile() and passing in an
"OVERLAPPED" struct with the offset and sets errno on error, then set
up the macros so that Windows can use them as pread(), pwrite(). It
might also be necessary to open all files with FILE_FLAG_OVERLAPPED.
Does any Windows hacker have a bettter idea, and/or want to try to
write that patch? Otherwise I'll eventually try to do some long
distance hacking on AppVeyor.
--
Thomas Munro
http://www.enterprisedb.com
Attachments:
0001-Use-pread-pwrite-instead-of-lseek-read-write-v2.patchapplication/octet-stream; name=0001-Use-pread-pwrite-instead-of-lseek-read-write-v2.patchDownload+82-274
On 20/07/18 01:50, Thomas Munro wrote:
An idea for how to handle Windows, in a follow-up patch: add a file
src/backend/port/win32/file.c that defines pgwin32_pread() and
pgwin32_pwrite(), wrapping WriteFile()/ReadFile() and passing in an
"OVERLAPPED" struct with the offset and sets errno on error, then set
up the macros so that Windows can use them as pread(), pwrite(). It
might also be necessary to open all files with FILE_FLAG_OVERLAPPED.
Does any Windows hacker have a bettter idea, and/or want to try to
write that patch? Otherwise I'll eventually try to do some long
distance hacking on AppVeyor.
No objections, if you want to make the effort. But IMHO the lseek+read
fallback is good enough on Windows. Unless you were thinking that we
could then remove the !HAVE_PREAD fallback altogether. Are there any
other platforms out there that don't have pread/pwrite that we care about?
- Heikki
On Thu, Jul 12, 2018 at 01:55:31PM +1200, Thomas Munro wrote:
A couple of years ago, Oskari Saarenmaa proposed a patch[1] to adopt
$SUBJECT. Last year, Tobias Oberstein argued again that we should do
that[2]. At the end of that thread there was a +1 from multiple
committers in support of getting it done for PostgreSQL 12. Since
Oskari hasn't returned, I decided to dust off his patch and start a
new thread.
Thanks for picking this up - I was meaning to get back to this, but have
unfortunately been too busy with other projects.
1. FileRead() and FileWrite() are replaced with FileReadAt() and
FileWriteAt(), taking a new offset argument. Now we can do one
syscall instead of two for random reads and writes.
[...] I'm not even sure I'd bother adding "At" to the
function names. If there are any extensions that want the old
interface they will fail to compile either way. Note that the BufFile
interface remains the same, so hopefully that covers many use cases.
IIRC I used the "At" suffixes in my first version of the patch before
completely removing the functions which didn't take an offset argument
Now that they're gone I agree that we could just drop the "At" suffix;
"at" suffix is also used by various POSIX functions to operate in a
specific directory which may just add to confusion.
I guess the only remaining reason to use FileSeek() is to get the file
size? So I wonder why SEEK_SET remains valid in the patch... if my
suspicion is correct that only SEEK_END still has a reason to exist,
perhaps we should just kill FileSeek() and add FileSize() or something
instead?
I see you did this in your updated patch :+1:
Happy to see this patch move forward.
/ Oskari
Heikki Linnakangas <hlinnaka@iki.fi> writes:
No objections, if you want to make the effort. But IMHO the lseek+read
fallback is good enough on Windows. Unless you were thinking that we
could then remove the !HAVE_PREAD fallback altogether. Are there any
other platforms out there that don't have pread/pwrite that we care about?
AFAICT, macOS has them as far back as we care about (prairiedog does).
HPUX 10.20 (gaur/pademelon) does not, so personally I'd like to keep
the lseek+read workaround. Don't know about the oldest Solaris critters
we have in the buildfarm. FreeBSD has had 'em at least since 4.0 (1994);
didn't check the other BSDen.
SUS v2 (POSIX 1997) does specify both functions, so we could insist on
their presence without breaking any of our own portability guidelines.
However, if we have to have some workaround anyway for Windows, it
seems like including an lseek+read code path is reasonable so that we
needn't retire those oldest buildfarm critters.
regards, tom lane
On 20 Jul 2018, at 17:34, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Heikki Linnakangas <hlinnaka@iki.fi> writes:
No objections, if you want to make the effort. But IMHO the lseek+read
fallback is good enough on Windows. Unless you were thinking that we
could then remove the !HAVE_PREAD fallback altogether. Are there any
other platforms out there that don't have pread/pwrite that we care about?AFAICT, macOS has them as far back as we care about (prairiedog does).
HPUX 10.20 (gaur/pademelon) does not, so personally I'd like to keep
the lseek+read workaround. Don't know about the oldest Solaris critters
we have in the buildfarm. FreeBSD has had 'em at least since 4.0 (1994);
didn't check the other BSDen.
The OpenBSD box I have access to has pwrite/pread, and have had for some time
if I read the manpage right.
cheers ./daniel
On Fri, Jul 20, 2018 at 8:14 PM, Oskari Saarenmaa <os@ohmu.fi> wrote:
On Thu, Jul 12, 2018 at 01:55:31PM +1200, Thomas Munro wrote:
[...] I'm not even sure I'd bother adding "At" to the
function names. If there are any extensions that want the old
interface they will fail to compile either way. Note that the BufFile
interface remains the same, so hopefully that covers many use cases.IIRC I used the "At" suffixes in my first version of the patch before
completely removing the functions which didn't take an offset argument
Now that they're gone I agree that we could just drop the "At" suffix;
"at" suffix is also used by various POSIX functions to operate in a
specific directory which may just add to confusion.
Done. Rebased.
--
Thomas Munro
http://www.enterprisedb.com
Attachments:
0001-Use-pread-pwrite-instead-of-lseek-read-write-v3.patchapplication/octet-stream; name=0001-Use-pread-pwrite-instead-of-lseek-read-write-v3.patchDownload+72-264
On Sat, Jul 21, 2018 at 3:34 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Heikki Linnakangas <hlinnaka@iki.fi> writes:
No objections, if you want to make the effort. But IMHO the lseek+read
fallback is good enough on Windows. Unless you were thinking that we
could then remove the !HAVE_PREAD fallback altogether. Are there any
other platforms out there that don't have pread/pwrite that we care about?AFAICT, macOS has them as far back as we care about (prairiedog does).
HPUX 10.20 (gaur/pademelon) does not, so personally I'd like to keep
the lseek+read workaround. Don't know about the oldest Solaris critters
we have in the buildfarm. FreeBSD has had 'em at least since 4.0 (1994);
didn't check the other BSDen.SUS v2 (POSIX 1997) does specify both functions, so we could insist on
their presence without breaking any of our own portability guidelines.
However, if we have to have some workaround anyway for Windows, it
seems like including an lseek+read code path is reasonable so that we
needn't retire those oldest buildfarm critters.
Yeah it seems useful and cheap to carry the lseek() fallback. But
actually there is a good reason to implement proper pread/pwrite
(equivalent) on Windows: this patch removes the position tracking, so
that the fallback code generates *more* lseek() calls than current
master. For example with sequential reads today we are smart enough
to skip redundant lseek() calls, but this patch removes those smarts.
I doubt anyone cares about that on HPUX 10.20 but I don't think we
should do that on Windows.
--
Thomas Munro
http://www.enterprisedb.com
Hi,
On 07/26/2018 10:04 PM, Thomas Munro wrote:
Done. Rebased.
This needs a rebase again.
Once resolved the patch passes make check-world, and a strace analysis
shows the associated read()/write() have been turned into
pread64()/pwrite64(). All lseek()'s are SEEK_END's.
Best regards,
Jesper
Hi,
On 09/05/2018 02:42 PM, Jesper Pedersen wrote:
On 07/26/2018 10:04 PM, Thomas Munro wrote:
Done. Rebased.
This needs a rebase again.
Would it be of benefit to update these call sites
* slru.c
- SlruPhysicalReadPage
- SlruPhysicalWritePage
* xlogutils.c
- XLogRead
* pg_receivewal.c
- FindStreamingStart
* rewriteheap.c
- heap_xlog_logical_rewrite
* walreceiver.c
- XLogWalRcvWrite
too ?
Best regards,
Jesper
On Fri, Sep 7, 2018 at 2:17 AM Jesper Pedersen
<jesper.pedersen@redhat.com> wrote:
This needs a rebase again.
Done.
Would it be of benefit to update these call sites
* slru.c
- SlruPhysicalReadPage
- SlruPhysicalWritePage
* xlogutils.c
- XLogRead
* pg_receivewal.c
- FindStreamingStart
* rewriteheap.c
- heap_xlog_logical_rewrite
* walreceiver.c
- XLogWalRcvWrite
It certainly wouldn't hurt... but more pressing to get this committed
would be Windows support IMHO. I think the thing to do is to open
files with the FILE_FLAG_OVERLAPPED flag, and then use ReadFile() and
WriteFile() with an LPOVERLAPPED struct that holds an offset, but I'm
not sure if I can write that myself. I tried doing some semi-serious
Windows development for the fsyncgate patch using only AppVeyor CI a
couple of weeks ago and it was like visiting the dentist.
On Thu, Sep 6, 2018 at 6:42 AM Jesper Pedersen
<jesper.pedersen@redhat.com> wrote:
Once resolved the patch passes make check-world, and a strace analysis
shows the associated read()/write() have been turned into
pread64()/pwrite64(). All lseek()'s are SEEK_END's.
Yeah :-) Just for fun, here is the truss output for a single pgbench
transaction here:
recvfrom(9,"B\0\0\0\^R\0P0_4\0\0\0\0\0\0\^A"...,8192,0,NULL,0x0) = 41 (0x29)
sendto(9,"2\0\0\0\^Dn\0\0\0\^DC\0\0\0\nBEG"...,27,0,NULL,0) = 27 (0x1b)
recvfrom(9,"B\0\0\0&\0P0_5\0\0\0\0\^B\0\0\0"...,8192,0,NULL,0x0) = 61 (0x3d)
pread(22,"\0\0\0\0\^P\M^C\M-@P\0\0\0\0\M-X"...,8192,0x960a000) = 8192 (0x2000)
pread(20,"\0\0\0\0\M^X\^D\M^Iq\0\0\0\0\^T"...,8192,0x380fe000) = 8192 (0x2000)
sendto(9,"2\0\0\0\^Dn\0\0\0\^DC\0\0\0\rUPD"...,30,0,NULL,0) = 30 (0x1e)
recvfrom(9,"B\0\0\0\^]\0P0_6\0\0\0\0\^A\0\0"...,8192,0,NULL,0x0) = 52 (0x34)
sendto(9,"2\0\0\0\^DT\0\0\0!\0\^Aabalance"...,75,0,NULL,0) = 75 (0x4b)
recvfrom(9,"B\0\0\0"\0P0_7\0\0\0\0\^B\0\0\0"...,8192,0,NULL,0x0) = 57 (0x39)
sendto(9,"2\0\0\0\^Dn\0\0\0\^DC\0\0\0\rUPD"...,30,0,NULL,0) = 30 (0x1e)
recvfrom(9,"B\0\0\0!\0P0_8\0\0\0\0\^B\0\0\0"...,8192,0,NULL,0x0) = 56 (0x38)
lseek(29,0x0,SEEK_END) = 8192 (0x2000)
sendto(9,"2\0\0\0\^Dn\0\0\0\^DC\0\0\0\rUPD"...,30,0,NULL,0) = 30 (0x1e)
recvfrom(9,"B\0\0\0003\0P0_9\0\0\0\0\^D\0\0"...,8192,0,NULL,0x0) = 74 (0x4a)
sendto(9,"2\0\0\0\^Dn\0\0\0\^DC\0\0\0\^OIN"...,32,0,NULL,0) = 32 (0x20)
recvfrom(9,"B\0\0\0\^S\0P0_10\0\0\0\0\0\0\^A"...,8192,0,NULL,0x0) = 42 (0x2a)
pwrite(33,"\M^X\M-P\^E\0\^A\0\0\0\0\M-`\M^A"...,16384,0x81e000) = 16384 (0x4000)
fdatasync(0x21) = 0 (0x0)
sendto(9,"2\0\0\0\^Dn\0\0\0\^DC\0\0\0\vCOM"...,28,0,NULL,0) = 28 (0x1c)
There is only one lseek() left. I actually have another patch that
gets rid of even that (by caching sizes in SMgrRelation using a shared
invalidation counter which I'm not yet sure about). Then pgbench's
7-round-trip transaction makes only the strictly necessary 18 syscalls
(every one an explainable network message, disk page or sync).
Unpatched master has 5 extra lseek()s.
--
Thomas Munro
http://www.enterprisedb.com
Attachments:
0001-Use-pread-pwrite-instead-of-lseek-read-write-v4.patchapplication/octet-stream; name=0001-Use-pread-pwrite-instead-of-lseek-read-write-v4.patchDownload+72-264
On Wed, Sep 19, 2018 at 1:48 PM Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
On Fri, Sep 7, 2018 at 2:17 AM Jesper Pedersen
<jesper.pedersen@redhat.com> wrote:This needs a rebase again.
And again, due to the conflict with ppoll in AC_CHECK_FUNCS.
--
Thomas Munro
http://www.enterprisedb.com
Attachments:
0001-Use-pread-pwrite-instead-of-lseek-read-write-v5.patchapplication/octet-stream; name=0001-Use-pread-pwrite-instead-of-lseek-read-write-v5.patchDownload+72-264
Hi Thomas,
On 9/18/18 9:48 PM, Thomas Munro wrote:
It certainly wouldn't hurt... but more pressing to get this committed
would be Windows support IMHO. I think the thing to do is to open
files with the FILE_FLAG_OVERLAPPED flag, and then use ReadFile() and
WriteFile() with an LPOVERLAPPED struct that holds an offset, but I'm
not sure if I can write that myself. I tried doing some semi-serious
Windows development for the fsyncgate patch using only AppVeyor CI a
couple of weeks ago and it was like visiting the dentist.
Sorry, no idea about this. Maybe Magnus can provide some feedback ?
On Thu, Sep 6, 2018 at 6:42 AM Jesper Pedersen
Once resolved the patch passes make check-world, and a strace analysis
shows the associated read()/write() have been turned into
pread64()/pwrite64(). All lseek()'s are SEEK_END's.Yeah :-)
Thanks for v5 too.
Best regards,
Jesper
On Fri, Sep 28, 2018 at 2:03 AM Jesper Pedersen
<jesper.pedersen@redhat.com> wrote:
Thanks for v5 too.
Rebased again. Patches that touch AC_CHECK_FUNCS are fun like that!
--
Thomas Munro
http://www.enterprisedb.com
Attachments:
0001-Use-pread-pwrite-instead-of-lseek-read-write-v6.patchapplication/octet-stream; name=0001-Use-pread-pwrite-instead-of-lseek-read-write-v6.patchDownload+72-264
Thomas Munro <thomas.munro@enterprisedb.com> writes:
Rebased again. Patches that touch AC_CHECK_FUNCS are fun like that!
Yeah, I've been burnt by that too recently. It occurs to me we could make
that at least a little less painful if we formatted the macro with one
line per function name:
AC_CHECK_FUNCS([
cbrt
clock_gettime
fdatasync
...
wcstombs_l
])
You'd still get conflicts in configure itself, of course, but that
doesn't require manual work to resolve -- just re-run autoconf.
regards, tom lane
On Tue, Oct 9, 2018 at 2:55 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Thomas Munro <thomas.munro@enterprisedb.com> writes:
Rebased again. Patches that touch AC_CHECK_FUNCS are fun like that!
Yeah, I've been burnt by that too recently. It occurs to me we could make
that at least a little less painful if we formatted the macro with one
line per function name:AC_CHECK_FUNCS([
cbrt
clock_gettime
fdatasync
...
wcstombs_l
])You'd still get conflicts in configure itself, of course, but that
doesn't require manual work to resolve -- just re-run autoconf.
+1, was about to suggest the same!
--
Thomas Munro
http://www.enterprisedb.com
Thomas Munro <thomas.munro@enterprisedb.com> writes:
On Tue, Oct 9, 2018 at 2:55 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Yeah, I've been burnt by that too recently. It occurs to me we could make
that at least a little less painful if we formatted the macro with one
line per function name:AC_CHECK_FUNCS([
cbrt
clock_gettime
fdatasync
...
wcstombs_l
])You'd still get conflicts in configure itself, of course, but that
doesn't require manual work to resolve -- just re-run autoconf.
+1, was about to suggest the same!
Sold, I'll go do it.
regards, tom lane
I wrote:
Thomas Munro <thomas.munro@enterprisedb.com> writes:
On Tue, Oct 9, 2018 at 2:55 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Yeah, I've been burnt by that too recently. It occurs to me we could make
that at least a little less painful if we formatted the macro with one
line per function name:
+1, was about to suggest the same!
Sold, I'll go do it.
Learned a few new things about M4 along the way :-( ... but done.
You'll need to rebase the pread patch again of course.
regards, tom lane
On 10/08/2018 09:55 PM, Tom Lane wrote:
Thomas Munro <thomas.munro@enterprisedb.com> writes:
Rebased again. Patches that touch AC_CHECK_FUNCS are fun like that!
Yeah, I've been burnt by that too recently. It occurs to me we could make
that at least a little less painful if we formatted the macro with one
line per function name:AC_CHECK_FUNCS([
cbrt
clock_gettime
fdatasync
...
wcstombs_l
])You'd still get conflicts in configure itself, of course, but that
doesn't require manual work to resolve -- just re-run autoconf.
By and large I think it's better not to submit patches with changes to
configure, but to let the committer run autoconf.
You can avoid getting such changes in your patches by doing something
like this:
git config diff.nodiff.command /bin/true
echo configure diff=nodiff >> .git/info/attributes
If you actually want to turn this off and see any diffs in configure, run
git diff --no-ext-diff
It's also possible to supply a filter expression to 'git diff'.
OTOH, this will probably confuse the heck out of the cfbot patch checker.
cheers
andrew
--
Andrew Dunstan https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2018-10-09 14:32:29 -0400, Andrew Dunstan wrote:
On 10/08/2018 09:55 PM, Tom Lane wrote:
Thomas Munro <thomas.munro@enterprisedb.com> writes:
Rebased again. Patches that touch AC_CHECK_FUNCS are fun like that!
Yeah, I've been burnt by that too recently. It occurs to me we could make
that at least a little less painful if we formatted the macro with one
line per function name:AC_CHECK_FUNCS([
cbrt
clock_gettime
fdatasync
...
wcstombs_l
])You'd still get conflicts in configure itself, of course, but that
doesn't require manual work to resolve -- just re-run autoconf.By and large I think it's better not to submit patches with changes to
configure, but to let the committer run autoconf.
OTOH, this will probably confuse the heck out of the cfbot patch checker.
And make life harder for reviewers.
-1 on this one.
Greetings,
Andres Freund