Simplified version of read_binary_file (src/backend/utils/adt/genfile.c)

Started by Ranier Vilelaover 5 years ago12 messageshackers
Jump to latest
#1Ranier Vilela
ranier.vf@gmail.com

Hi,

read_binary_file seems a bit complicated when we want to read the rest of
the file (-1 for bytes_to_read).
This version, it seems, has no opposite effects, someone can confirm?

regards,
Ranier Vilela

Attachments:

v1-0001-simplified_read_binary_file.patchapplication/octet-stream; name=v1-0001-simplified_read_binary_file.patchDownload+19-67
#2Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Ranier Vilela (#1)
Re: Simplified version of read_binary_file (src/backend/utils/adt/genfile.c)

On 2020-Sep-11, Ranier Vilela wrote:

Hi,

read_binary_file seems a bit complicated when we want to read the rest of
the file (-1 for bytes_to_read).

This code was very recently rewritten in 96d1f423f95d, and I doubt that
taking out half the algorithm without studying how it got that way is a
great idea.

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#3Ranier Vilela
ranier.vf@gmail.com
In reply to: Alvaro Herrera (#2)
Re: Simplified version of read_binary_file (src/backend/utils/adt/genfile.c)

Em sex., 11 de set. de 2020 às 14:01, Alvaro Herrera <
alvherre@2ndquadrant.com> escreveu:

On 2020-Sep-11, Ranier Vilela wrote:

Hi,

read_binary_file seems a bit complicated when we want to read the rest of
the file (-1 for bytes_to_read).

This code was very recently rewritten in 96d1f423f95d, and I doubt that
taking out half the algorithm without studying how it got that way is a
great idea.

Works with all regress tests (199, same as the HEAD with msvc 2019 64 bits
except partition_prune)
Works with pgbench (pgbench -U postgres -c 50 -j 2 -t 10000 example)
And works with local installation, without adverse effects, for now.

regards,
Ranier Vilela

#4Andres Freund
andres@anarazel.de
In reply to: Ranier Vilela (#3)
Re: Simplified version of read_binary_file (src/backend/utils/adt/genfile.c)

On 2020-09-11 14:10:31 -0300, Ranier Vilela wrote:

Em sex., 11 de set. de 2020 �s 14:01, Alvaro Herrera <
alvherre@2ndquadrant.com> escreveu:

On 2020-Sep-11, Ranier Vilela wrote:
This code was very recently rewritten in 96d1f423f95d, and I doubt that
taking out half the algorithm without studying how it got that way is a
great idea.

Works with all regress tests (199, same as the HEAD with msvc 2019 64 bits
except partition_prune)
Works with pgbench (pgbench -U postgres -c 50 -j 2 -t 10000 example)
And works with local installation, without adverse effects, for now.

Have your read the commit message for 96d1f423f95d ?

#5Ranier Vilela
ranier.vf@gmail.com
In reply to: Andres Freund (#4)
Re: Simplified version of read_binary_file (src/backend/utils/adt/genfile.c)

Em sex., 11 de set. de 2020 às 15:09, Andres Freund <andres@anarazel.de>
escreveu:

On 2020-09-11 14:10:31 -0300, Ranier Vilela wrote:

Em sex., 11 de set. de 2020 às 14:01, Alvaro Herrera <
alvherre@2ndquadrant.com> escreveu:

On 2020-Sep-11, Ranier Vilela wrote:
This code was very recently rewritten in 96d1f423f95d, and I doubt that
taking out half the algorithm without studying how it got that way is a
great idea.

Works with all regress tests (199, same as the HEAD with msvc 2019 64

bits

except partition_prune)
Works with pgbench (pgbench -U postgres -c 50 -j 2 -t 10000 example)
And works with local installation, without adverse effects, for now.

Have your read the commit message for 96d1f423f95d ?

Yead, I read.
He's concerned about virtual file (pipe, FIFO, socket).
The patch ( 96d1f423f95d ) has the same problem.
fseeko with virtual file can fail too.

https://man7.org/linux/man-pages/man3/fseek.3.html
ESPIPE The file descriptor underlying stream is not seekable (e.g.,
it refers to a pipe, FIFO, or socket)

Call read_binary_file with virtual file,will log in:
"could not seek in file"

regards,
Ranier Vilela

#6Ranier Vilela
ranier.vf@gmail.com
In reply to: Ranier Vilela (#5)
Re: Simplified version of read_binary_file (src/backend/utils/adt/genfile.c)

New version, with support to read Virtual File (pipe, FIFO and socket).
With assert, in case, erroneous, of trying to read a pipe, with offset.

regards,
Ranier Vilela

Attachments:

v2-0001-simplified_read_binary_file.patchapplication/octet-stream; name=v2-0001-simplified_read_binary_file.patchDownload+34-53
#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Ranier Vilela (#6)
Re: Simplified version of read_binary_file (src/backend/utils/adt/genfile.c)

Ranier Vilela <ranier.vf@gmail.com> writes:

New version, with support to read Virtual File (pipe, FIFO and socket).
With assert, in case, erroneous, of trying to read a pipe, with offset.

Really, could you do a little more thinking and testing of your own,
rather than expecting the rest of us to point out holes in your thinking?

* bytes_to_read == 0 is a legal case.

* "Assert(seek_offset != 0)" is an equally awful idea.

* Removing the seek code from the negative-bytes_to_read path
is just broken.

* The only reason this code is shorter than the previous version is
you took out all the comments explaining what it was doing, and
failed to replace them. This is just as subtle as before, so I
don't find that acceptable.

I do agree that it might be worth skipping the fseeko call in the
probably-common case where seek_offset is zero. Otherwise I don't
see much reason to change what's there.

regards, tom lane

#8Ranier Vilela
ranier.vf@gmail.com
In reply to: Tom Lane (#7)
Re: Simplified version of read_binary_file (src/backend/utils/adt/genfile.c)

Em sex., 11 de set. de 2020 às 17:44, Tom Lane <tgl@sss.pgh.pa.us> escreveu:

Ranier Vilela <ranier.vf@gmail.com> writes:

New version, with support to read Virtual File (pipe, FIFO and socket).
With assert, in case, erroneous, of trying to read a pipe, with offset.

Really, could you do a little more thinking and testing of your own,
rather than expecting the rest of us to point out holes in your thinking?

Yes, I can.

* bytes_to_read == 0 is a legal case.

Ok. Strange call to read a file with zero bytes.

* "Assert(seek_offset != 0)" is an equally awful idea.

I was trying to predict the case of reading a Virtual File, with
bytes_to_read == -1 and seek_offset == 0,
which is bound to fail in fseeko.
In any case, 96d1f423f95d it will fail with any Virtual File.

* Removing the seek code from the negative-bytes_to_read path
is just broken.

Ok.

* The only reason this code is shorter than the previous version is
you took out all the comments explaining what it was doing, and
failed to replace them. This is just as subtle as before, so I
don't find that acceptable.

It was not my intention.

I do agree that it might be worth skipping the fseeko call in the
probably-common case where seek_offset is zero. Otherwise I don't
see much reason to change what's there.

Well, I think that v3 is a little better, but it’s just my opinion.
v3 try to copy directly in the final buffer, which will certainly be a
little faster.

regards,
Ranier Vilela

Attachments:

v3-0001-simplified_read_binary_file.patchapplication/octet-stream; name=v3-0001-simplified_read_binary_file.patchDownload+33-39
#9Ranier Vilela
ranier.vf@gmail.com
In reply to: Tom Lane (#7)
Re: Simplified version of read_binary_file (src/backend/utils/adt/genfile.c)

Em sex., 11 de set. de 2020 às 17:44, Tom Lane <tgl@sss.pgh.pa.us> escreveu:

Ranier Vilela <ranier.vf@gmail.com> writes:

New version, with support to read Virtual File (pipe, FIFO and socket).
With assert, in case, erroneous, of trying to read a pipe, with offset.

I do agree that it might be worth skipping the fseeko call in the
probably-common case where seek_offset is zero. Otherwise I don't
see much reason to change what's there.

Tom, if you think it's worth it, I agree with only avoiding syscal fseeko.

regards,
Ranier Vilela

Attachments:

v1-0001-avoid_syscall_fseeko_read_binary_file.patchapplication/octet-stream; name=v1-0001-avoid_syscall_fseeko_read_binary_file.patchDownload+4-2
#10Ranier Vilela
ranier.vf@gmail.com
In reply to: Ranier Vilela (#8)
Re: Simplified version of read_binary_file (src/backend/utils/adt/genfile.c)

Em sex., 11 de set. de 2020 às 18:43, Ranier Vilela <ranier.vf@gmail.com>
escreveu:

Em sex., 11 de set. de 2020 às 17:44, Tom Lane <tgl@sss.pgh.pa.us>
escreveu:

Ranier Vilela <ranier.vf@gmail.com> writes:

New version, with support to read Virtual File (pipe, FIFO and socket).
With assert, in case, erroneous, of trying to read a pipe, with offset.

Really, could you do a little more thinking and testing of your own,
rather than expecting the rest of us to point out holes in your thinking?

Yes, I can.

* bytes_to_read == 0 is a legal case.

Ok. Strange call to read a file with zero bytes.

* "Assert(seek_offset != 0)" is an equally awful idea.

I was trying to predict the case of reading a Virtual File, with
bytes_to_read == -1 and seek_offset == 0,
which is bound to fail in fseeko.
In any case, 96d1f423f95d it will fail with any Virtual File.

* Removing the seek code from the negative-bytes_to_read path
is just broken.

Ok.

* The only reason this code is shorter than the previous version is
you took out all the comments explaining what it was doing, and
failed to replace them. This is just as subtle as before, so I
don't find that acceptable.

It was not my intention.

I do agree that it might be worth skipping the fseeko call in the
probably-common case where seek_offset is zero. Otherwise I don't
see much reason to change what's there.

Well, I think that v3 is a little better, but it’s just my opinion.
v3 try to copy directly in the final buffer, which will certainly be a
little faster.

v4 patch, simple benchmark, read binary file with size > 6GB.

PostgreSQL main:
postgres=# \timing on
Timing is on.
postgres=# select
pg_read_file('c:\users\ranier\downloads\macOS_Catalina.7z');
ERROR: file length too large
Time: 3068,459 ms (00:03,068)
postgres=#

PostgreSQL patched (v4 read_binary_file):
postgres=# \timing on
Timing is on.
postgres=# select
pg_read_file('c:\users\ranier\downloads\macOS_Catalina.7z');
ERROR: file length too large
Time: 701,035 ms
postgres=#

4.37x faster, very good.

v4 handles pipes very well.
Tested with
https://docs.microsoft.com/en-us/windows/win32/ipc/multithreaded-pipe-server

Terminal one:
C:\usr\src\tests\pipes>pipe_server

Pipe Server: Main thread awaiting client connection on \\.\pipe\mynamedpipe

Terminal two:
postgres=# \timing on
Timing is on.
postgres=# select pg_read_file('\\.\pipe\mynamedpipe');

Terminal one:
Client connected, creating a processing thread.

Pipe Server: Main thread awaiting client connection on \\.\pipe\mynamedpipe
InstanceThread created, receiving and processing messages.

Pipe Server: Main thread awaiting client connection on \\.\pipe\mynamedpipe
InstanceThread created, receiving and processing messages.
^C
C:\usr\src\tests\pipes>

Terminal two:
postgres=# select pg_read_file('\\.\pipe\mynamedpipe');
pg_read_file
--------------

(1 row)

Time: 77267,481 ms (01:17,267)
postgres=#

regards,
Ranier Vilela

Attachments:

v4-0001-simplified_read_binary_file.patchapplication/octet-stream; name=v4-0001-simplified_read_binary_file.patchDownload+37-46
#11Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Ranier Vilela (#10)
Re: Simplified version of read_binary_file (src/backend/utils/adt/genfile.c)

I think you meant _IONBF instead of _IOFBF -- otherwise it's at odds
with the comment you add. But what is the justification for that
addition? I don't see us doing that anywhere else.

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#12Ranier Vilela
ranier.vf@gmail.com
In reply to: Alvaro Herrera (#11)
Re: Simplified version of read_binary_file (src/backend/utils/adt/genfile.c)

Em ter., 15 de set. de 2020 às 14:54, Alvaro Herrera <
alvherre@2ndquadrant.com> escreveu:

I think you meant _IONBF instead of _IOFBF -- otherwise it's at odds
with the comment you add. But what is the justification for that
addition? I don't see us doing that anywhere else.

No.
_IOFBF *Full buffering:* On output, data is written once the buffer is full
(or flushed <http://www.cplusplus.com/fflush&gt;). On Input, the buffer is
filled when an input operation is requested and the buffer is empty.
* _IONBF No buffering:* No buffer is used. Each I/O operation is written as
soon as possible. In this case, the *buffer* and *size* parameters are
ignored.
_IONBF ignores buffer and size.

Without setvbuf, fread uses an internal buffer, default 4096 bytes (OS
dependent).
If fread uses an internal buffer, then it needs a copy to the buffer
provided by the function.
setvbuf, does the same as read function low level, copies directly to the
final buffer.

regards,
Ranier Vilela