Flushing large data immediately in pqcomm

Started by Melih Mutluabout 2 years ago53 messages

m.melihmutlu@gmail.com

about 2 years ago

1 attachment(s)

Hi hackers

I've been looking into ways to reduce the overhead we're having in pqcomm
and I'd like to propose a small patch to modify how socket_putmessage works.

Currently socket_putmessage copies any input data into the pqcomm send
buffer (PqSendBuffer) and the size of this buffer is 8K. When the send
buffer gets full, it's flushed and we continue to copy more data into the
send buffer until we have no data left to be sent.
Since the send buffer is flushed whenever it's full, I think we are safe to
say that if the size of input data is larger than the buffer size, which is
8K, then the buffer will be flushed at least once (or more, depends on the
input size) to store and all the input data.

Proposed change modifies socket_putmessage to send any data larger than
8K immediately without copying it into the send buffer. Assuming that the
send buffer would be flushed anyway due to reaching its limit, the patch
just gets rid of the copy part which seems unnecessary and sends data
without waiting.

This change affects places where pq_putmessage is used such as
pg_basebackup, COPY TO, walsender etc.

I did some experiments to see how the patch performs.
Firstly, I loaded ~5GB data into a table [1]CREATE TABLE test(id int, name text, time TIMESTAMP); INSERT INTO test (id, name, time) SELECT i AS id, repeat('dummy', 100) AS name, NOW() AS time FROM generate_series(1, 100000000) AS i;, then ran "COPY test TO
STDOUT". Here are perf results of both the patch and HEAD

HEAD:
-   94,13%     0,22%  postgres  postgres           [.] DoCopyTo
  - 93,90% DoCopyTo
      - 91,80% CopyOneRowTo
         + 47,35% CopyAttributeOutText
         - 26,49% CopySendEndOfRow
            - 25,97% socket_putmessage
               - internal_putbytes
                  - 24,38% internal_flush
                     + secure_write
                  + 1,47% memcpy (inlined)
         + 14,69% FunctionCall1Coll
         + 1,94% appendBinaryStringInfo
         + 0,75% MemoryContextResetOnly
      + 1,54% table_scan_getnextslot (inlined)

Patch:
-   94,40%     0,30%  postgres  postgres           [.] DoCopyTo
   - 94,11% DoCopyTo
      - 92,41% CopyOneRowTo
         + 51,20% CopyAttributeOutText
         - 20,87% CopySendEndOfRow
            - 20,45% socket_putmessage
               - internal_putbytes
                  - 18,50% internal_flush (inlined)
                       internal_flush_buffer
                     + secure_write
                  + 1,61% memcpy (inlined)
         + 17,36% FunctionCall1Coll
         + 1,33% appendBinaryStringInfo
         + 0,93% MemoryContextResetOnly
      + 1,36% table_scan_getnextslot (inlined)

The patch brings a ~5% gain in socket_putmessage.

Also timed the pg_basebackup like:
time pg_basebackup -p 5432 -U replica_user -X none -c fast --no_maanifest
-D test

HEAD:
real 0m10,040s
user 0m0,768s
sys 0m7,758s

Patch:
real 0m8,882s
user 0m0,699s
sys 0m6,980s

It seems ~11% faster in this specific case.

I'd appreciate any feedback/thoughts.

[1]: CREATE TABLE test(id int, name text, time TIMESTAMP); INSERT INTO test (id, name, time) SELECT i AS id, repeat('dummy', 100) AS name, NOW() AS time FROM generate_series(1, 100000000) AS i;
CREATE TABLE test(id int, name text, time TIMESTAMP);
INSERT INTO test (id, name, time) SELECT i AS id, repeat('dummy', 100) AS
name, NOW() AS time FROM generate_series(1, 100000000) AS i;

Thanks,
--
Melih Mutlu
Microsoft

Attachments:

0001-Flush-large-data-immediately-in-pqcomm.patchapplication/octet-stream; name=0001-Flush-large-data-immediately-in-pqcomm.patchDownload

From 6427010b04e4dc2becb9d79be6cdf07804e74e34 Mon Sep 17 00:00:00 2001
From: Melih Mutlu <m.melihmutlu@gmail.com>
Date: Mon, 20 Nov 2023 11:20:52 +0300
Subject: [PATCH] Flush large data immediately in pqcomm

If the data is larger than send buffer size in pqcomm, we're sure that
the send buffer will be flushed at least once to fit the data depending
on how large the data is.

Instead of memcpy'ing and then flushing data larger than 8K, this patch
changes socket_putmessage logic to flush large data immediately without
unnecessarily copying it to the pqcom send buffer.
---
 src/backend/libpq/pqcomm.c | 48 ++++++++++++++++++++++++++++++++------
 1 file changed, 41 insertions(+), 7 deletions(-)

diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c
index 522584e597..b1c7221a18 100644
--- a/src/backend/libpq/pqcomm.c
+++ b/src/backend/libpq/pqcomm.c
@@ -146,6 +146,7 @@ static int	socket_putmessage(char msgtype, const char *s, size_t len);
 static void socket_putmessage_noblock(char msgtype, const char *s, size_t len);
 static int	internal_putbytes(const char *s, size_t len);
 static int	internal_flush(void);
+static int	internal_flush_buffer(const char *s, int *start, int *end);
 
 static int	Lock_AF_UNIX(const char *unixSocketDir, const char *unixSocketPath);
 static int	Setup_AF_UNIX(const char *sock_path);
@@ -1333,11 +1334,25 @@ socket_flush(void)
  */
 static int
 internal_flush(void)
+{
+	/* flush the pending output from send buffer. */
+	return internal_flush_buffer(PqSendBuffer, &PqSendStart, &PqSendPointer);
+}
+
+/* --------------------------------
+ *		internal_flush_buffer - flush the given buffer content
+ *
+ * Returns 0 if OK (meaning everything was sent, or operation would block
+ * and the socket is in non-blocking mode), or EOF if trouble.
+ * --------------------------------
+ */
+static int
+internal_flush_buffer(const char *s, int *start, int *end)
 {
 	static int	last_reported_send_errno = 0;
 
-	char	   *bufptr = PqSendBuffer + PqSendStart;
-	char	   *bufend = PqSendBuffer + PqSendPointer;
+	char	   *bufptr = (char*) s + *start;
+	char	   *bufend = (char*) s + *end;
 
 	while (bufptr < bufend)
 	{
@@ -1383,7 +1398,7 @@ internal_flush(void)
 			 * flag that'll cause the next CHECK_FOR_INTERRUPTS to terminate
 			 * the connection.
 			 */
-			PqSendStart = PqSendPointer = 0;
+			*start = *end = 0;
 			ClientConnectionLost = 1;
 			InterruptPending = 1;
 			return EOF;
@@ -1391,10 +1406,10 @@ internal_flush(void)
 
 		last_reported_send_errno = 0;	/* reset after any successful send */
 		bufptr += r;
-		PqSendStart += r;
+		*start += r;
 	}
 
-	PqSendStart = PqSendPointer = 0;
+	*start = *end = 0;
 	return 0;
 }
 
@@ -1470,6 +1485,7 @@ socket_putmessage(char msgtype, const char *s, size_t len)
 	if (PqCommBusy)
 		return 0;
 	PqCommBusy = true;
+	
 	if (internal_putbytes(&msgtype, 1))
 		goto fail;
 
@@ -1477,8 +1493,26 @@ socket_putmessage(char msgtype, const char *s, size_t len)
 	if (internal_putbytes((char *) &n32, 4))
 		goto fail;
 
-	if (internal_putbytes(s, len))
-		goto fail;
+	if (len >= PqSendBufferSize)
+	{
+		int start = 0;
+		int end = len;
+
+		socket_set_nonblocking(false);
+		/* send pending data first */
+		if (internal_flush())
+			goto fail;
+
+		/* send the large buffer without copying it into PqSendBuffer */
+		if (internal_flush_buffer(s, &start, &end))
+			goto fail;
+	}
+	else
+	{
+		if (internal_putbytes(s, len))
+			goto fail;
+	}
+
 	PqCommBusy = false;
 	return 0;
 
-- 
2.34.1

Heikki Linnakangas

hlinnaka@iki.fi

almost 2 years ago

In reply to: Melih Mutlu (#1)

Re: Flushing large data immediately in pqcomm

On 20/11/2023 14:21, Melih Mutlu wrote:

Hi hackers

I've been looking into ways to reduce the overhead we're having in
pqcomm and I'd like to propose a small patch to modify how
socket_putmessage works.

Currently socket_putmessage copies any input data into the pqcomm send
buffer (PqSendBuffer) and the size of this buffer is 8K. When the send
buffer gets full, it's flushed and we continue to copy more data into
the send buffer until we have no data left to be sent.
Since the send buffer is flushed whenever it's full, I think we are safe
to say that if the size of input data is larger than the buffer size,
which is 8K, then the buffer will be flushed at least once (or more,
depends on the input size) to store and all the input data.

Agreed, that's silly.

Proposed change modifies socket_putmessage to send any data larger than
8K immediately without copying it into the send buffer. Assuming that
the send buffer would be flushed anyway due to reaching its limit, the
patch just gets rid of the copy part which seems unnecessary and sends
data without waiting.

If there's already some data in PqSendBuffer, I wonder if it would be
better to fill it up with data, flush it, and then send the rest of the
data directly. Instead of flushing the partial data first. I'm afraid
that you'll make a tiny call to secure_write(), followed by a large one,
then a tine one again, and so forth. Especially when socket_putmessage
itself writes the msgtype and len, which are tiny, before the payload.

Perhaps we should invent a new pq_putmessage() function that would take
an input buffer with 5 bytes of space reserved before the payload.
pq_putmessage() could then fill in the msgtype and len bytes in the
input buffer and send that directly. (Not wedded to that particular API,
but something that would have the same effect)

This change affects places where pq_putmessage is used such as
pg_basebackup, COPY TO, walsender etc.

I did some experiments to see how the patch performs.
Firstly, I loaded ~5GB data into a table [1], then ran "COPY test TO
STDOUT". Here are perf results of both the patch and HEAD > ...
The patch brings a ~5% gain in socket_putmessage.

[1]
CREATE TABLE test(id int, name text, time TIMESTAMP);
INSERT INTO test (id, name, time) SELECT i AS id, repeat('dummy', 100)
AS name, NOW() AS time FROM generate_series(1, 100000000) AS i;

I'm surprised by these results, because each row in that table is < 600
bytes. PqSendBufferSize is 8kB, so the optimization shouldn't kick in in
that test. Am I missing something?

--
Heikki Linnakangas
Neon (https://neon.tech)

Robert Haas

robertmhaas@gmail.com

almost 2 years ago

In reply to: Heikki Linnakangas (#2)

Re: Flushing large data immediately in pqcomm

On Mon, Jan 29, 2024 at 11:12 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

Agreed, that's silly.

+1.

If there's already some data in PqSendBuffer, I wonder if it would be
better to fill it up with data, flush it, and then send the rest of the
data directly. Instead of flushing the partial data first. I'm afraid
that you'll make a tiny call to secure_write(), followed by a large one,
then a tine one again, and so forth. Especially when socket_putmessage
itself writes the msgtype and len, which are tiny, before the payload.

Perhaps we should invent a new pq_putmessage() function that would take
an input buffer with 5 bytes of space reserved before the payload.
pq_putmessage() could then fill in the msgtype and len bytes in the
input buffer and send that directly. (Not wedded to that particular API,
but something that would have the same effect)

I share the concern; I'm not sure about the best solution. I wonder if
it would be useful to have pq_putmessagev() in the style of writev()
et al. Or maybe what we need is secure_writev().

I also wonder if the threshold for sending data directly should be
smaller than the buffer size, and/or whether it should depend on the
buffer being empty. If we have an 8kB buffer that currently has
nothing in it, and somebody writes 2kB, I suspect it might be wrong to
copy that into the buffer. If the same buffer had 5kB used and 3kB
free, copying sounds a lot more likely to work out. The goal here is
probably to condense sequences of short messages into a single
transmission while sending long messages individually. I'm just not
quite sure what heuristic would do that most effectively.

--
Robert Haas
EDB: http://www.enterprisedb.com

Melih Mutlu

m.melihmutlu@gmail.com

almost 2 years ago

In reply to: Heikki Linnakangas (#2)

Re: Flushing large data immediately in pqcomm

Hi Heikki,

Heikki Linnakangas <hlinnaka@iki.fi>, 29 Oca 2024 Pzt, 19:12 tarihinde şunu
yazdı:

Proposed change modifies socket_putmessage to send any data larger than
8K immediately without copying it into the send buffer. Assuming that
the send buffer would be flushed anyway due to reaching its limit, the
patch just gets rid of the copy part which seems unnecessary and sends
data without waiting.

If there's already some data in PqSendBuffer, I wonder if it would be
better to fill it up with data, flush it, and then send the rest of the
data directly. Instead of flushing the partial data first. I'm afraid
that you'll make a tiny call to secure_write(), followed by a large one,
then a tine one again, and so forth. Especially when socket_putmessage
itself writes the msgtype and len, which are tiny, before the payload.

I agree that I could do better there without flushing twice for both
PqSendBuffer and
input data. PqSendBuffer always has some data, even if it's tiny, since
msgtype and len are added.

Perhaps we should invent a new pq_putmessage() function that would take
an input buffer with 5 bytes of space reserved before the payload.
pq_putmessage() could then fill in the msgtype and len bytes in the
input buffer and send that directly. (Not wedded to that particular API,
but something that would have the same effect)

I thought about doing this. The reason why I didn't was because I think
that such a change would require adjusting all input buffers wherever
pq_putmessage is called, and I did not want to touch that many different
places. These places where we need pq_putmessage might not be that many
though, I'm not sure.

This change affects places where pq_putmessage is used such as
pg_basebackup, COPY TO, walsender etc.

I did some experiments to see how the patch performs.
Firstly, I loaded ~5GB data into a table [1], then ran "COPY test TO
STDOUT". Here are perf results of both the patch and HEAD > ...
The patch brings a ~5% gain in socket_putmessage.

[1]
CREATE TABLE test(id int, name text, time TIMESTAMP);
INSERT INTO test (id, name, time) SELECT i AS id, repeat('dummy', 100)
AS name, NOW() AS time FROM generate_series(1, 100000000) AS i;

I'm surprised by these results, because each row in that table is < 600
bytes. PqSendBufferSize is 8kB, so the optimization shouldn't kick in in
that test. Am I missing something?

You're absolutely right. I made a silly mistake there. I also think that
the way I did perf analysis does not make much sense, even if one row of
the table is greater than 8kB.
Here are some quick timing results after being sure that it triggers this
patch's optimization. I need to think more on how to profile this with
perf. I hope to share proper results soon.

I just added a bit more zeros [1]INSERT INTO test (id, name, time) SELECT i AS id, repeat('dummy', 10000) AS name, NOW() AS time FROM generate_series(1, 1000000) AS i; and ran [2]rm /tmp/dummy && echo 3 | sudo tee /proc/sys/vm/drop_caches && time psql -d postgres -c "COPY test TO STDOUT;" > /tmp/dummy (hopefully measured the
correct thing)

HEAD:
real 2m48,938s
user 0m9,226s
sys 1m35,342s

Patch:
real 2m40,690s
user 0m8,492s
sys 1m31,001s

[1]: INSERT INTO test (id, name, time) SELECT i AS id, repeat('dummy', 10000) AS name, NOW() AS time FROM generate_series(1, 1000000) AS i;
INSERT INTO test (id, name, time) SELECT i AS id, repeat('dummy', 10000)
AS name, NOW() AS time FROM generate_series(1, 1000000) AS i;

[2]: rm /tmp/dummy && echo 3 | sudo tee /proc/sys/vm/drop_caches && time psql -d postgres -c "COPY test TO STDOUT;" > /tmp/dummy
rm /tmp/dummy && echo 3 | sudo tee /proc/sys/vm/drop_caches && time psql
-d postgres -c "COPY test TO STDOUT;" > /tmp/dummy

Thanks,
--
Melih Mutlu
Microsoft

Melih Mutlu

m.melihmutlu@gmail.com

almost 2 years ago

In reply to: Robert Haas (#3)

Re: Flushing large data immediately in pqcomm

Hi Robert,

Robert Haas <robertmhaas@gmail.com>, 29 Oca 2024 Pzt, 20:48 tarihinde şunu
yazdı:

If there's already some data in PqSendBuffer, I wonder if it would be
better to fill it up with data, flush it, and then send the rest of the
data directly. Instead of flushing the partial data first. I'm afraid
that you'll make a tiny call to secure_write(), followed by a large one,
then a tine one again, and so forth. Especially when socket_putmessage
itself writes the msgtype and len, which are tiny, before the payload.

Perhaps we should invent a new pq_putmessage() function that would take
an input buffer with 5 bytes of space reserved before the payload.
pq_putmessage() could then fill in the msgtype and len bytes in the
input buffer and send that directly. (Not wedded to that particular API,
but something that would have the same effect)

I share the concern; I'm not sure about the best solution. I wonder if
it would be useful to have pq_putmessagev() in the style of writev()
et al. Or maybe what we need is secure_writev().

I thought about using writev() for not only pq_putmessage() but
pq_putmessage_noblock() too. Currently, pq_putmessage_noblock()
repallocs PqSendBuffer
and copies input buffer, which can easily be larger than 8kB, into
PqSendBuffer.I
also discussed it with Thomas off-list. The thing is that I believe we
would need secure_writev() with SSL/GSS cases handled properly. I'm just
not sure if the effort would be worthwhile considering what we gain from it.

I also wonder if the threshold for sending data directly should be
smaller than the buffer size, and/or whether it should depend on the
buffer being empty.

You might be right. I'm not sure what the ideal threshold would be.

If we have an 8kB buffer that currently has
nothing in it, and somebody writes 2kB, I suspect it might be wrong to
copy that into the buffer. If the same buffer had 5kB used and 3kB
free, copying sounds a lot more likely to work out. The goal here is
probably to condense sequences of short messages into a single
transmission while sending long messages individually. I'm just not
quite sure what heuristic would do that most effectively.

Sounds like it's difficult to come up with a heuristic that would work well
enough for most cases.
One thing with sending data instead of copying it if the buffer is empty is
that initially the buffer is empty. I believe it will stay empty forever if
we do not copy anything when the buffer is empty. We can maybe simply set
the threshold to the buffer size/2 (4kB) and hope that will work better. Or
copy the data only if it fits into the remaining space in the buffer. What
do you think?

An additional note while I mentioned pq_putmessage_noblock(), I've been
testing sending input data immediately in pq_putmessage_noblock() without
blocking and copy the data into PqSendBuffer only if the socket would block
and cannot send it. Unfortunately, I don't have strong numbers to
demonstrate any improvement in perf or timing yet. But I still like to know
what would you think about it?

Thanks,
--
Melih Mutlu
Microsoft

Robert Haas

robertmhaas@gmail.com

almost 2 years ago

In reply to: Melih Mutlu (#5)

Re: Flushing large data immediately in pqcomm

On Tue, Jan 30, 2024 at 12:58 PM Melih Mutlu <m.melihmutlu@gmail.com> wrote:

Sounds like it's difficult to come up with a heuristic that would work well enough for most cases.
One thing with sending data instead of copying it if the buffer is empty is that initially the buffer is empty. I believe it will stay empty forever if we do not copy anything when the buffer is empty. We can maybe simply set the threshold to the buffer size/2 (4kB) and hope that will work better. Or copy the data only if it fits into the remaining space in the buffer. What do you think?

An additional note while I mentioned pq_putmessage_noblock(), I've been testing sending input data immediately in pq_putmessage_noblock() without blocking and copy the data into PqSendBuffer only if the socket would block and cannot send it. Unfortunately, I don't have strong numbers to demonstrate any improvement in perf or timing yet. But I still like to know what would you think about it?

I think this is an area where it's very difficult to foresee on
theoretical grounds what will be right in practice. The problem is
that the best algorithm probably depends on what usage patterns are
common in practice. I think one common usage pattern will be a bunch
of roughly equal-sized messages in a row, like CopyData or DataRow
messages -- but those messages won't have a consistent width. It would
probably be worth testing what behavior you see in such cases -- start
with say a stream of 100 byte messages and then gradually increase and
see how the behavior evolves.

But you can also have other patterns, with messages of different sizes
interleaved. In the case of FE-to-BE traffic, the extended query
protocol might be a good example of that: the Parse message could be
quite long, or not, but the Bind Describe Execute Sync messages that
follow are probably all short. That case doesn't arise in this
direction, but I can't think exactly of what cases that do. It seems
like someone would need to play around and try some different cases
and maybe log the sizes of the secure_write() calls with various
algorithms, and then try to figure out what's best. For example, if
the alternating short-write, long-write behavior that Heikki mentioned
is happening, and I do think that particular thing is a very real
risk, then you haven't got it figured out yet...

--
Robert Haas
EDB: http://www.enterprisedb.com

Jelte Fennema-Nio

postgres@jeltef.nl

almost 2 years ago

In reply to: Robert Haas (#6)

Re: Flushing large data immediately in pqcomm

On Tue, 30 Jan 2024 at 19:48, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Jan 30, 2024 at 12:58 PM Melih Mutlu <m.melihmutlu@gmail.com> wrote:

Sounds like it's difficult to come up with a heuristic that would work well enough for most cases.
One thing with sending data instead of copying it if the buffer is empty is that initially the buffer is empty. I believe it will stay empty forever if we do not copy anything when the buffer is empty. We can maybe simply set the threshold to the buffer size/2 (4kB) and hope that will work better. Or copy the data only if it fits into the remaining space in the buffer. What do you think?

An additional note while I mentioned pq_putmessage_noblock(), I've been testing sending input data immediately in pq_putmessage_noblock() without blocking and copy the data into PqSendBuffer only if the socket would block and cannot send it. Unfortunately, I don't have strong numbers to demonstrate any improvement in perf or timing yet. But I still like to know what would you think about it?

I think this is an area where it's very difficult to foresee on
theoretical grounds what will be right in practice

I agree that it's hard to prove that such heuristics will always be
better in practice than the status quo. But I feel like we shouldn't
let perfect be the enemy of good here. I one approach that is a clear
improvement over the status quo is:
1. If the buffer is empty AND the data we are trying to send is larger
than the buffer size, then don't use the buffer.
2. If not, fill up the buffer first (just like we do now) then send
that. And if the left over data is then still larger than the buffer,
then now the buffer is empty so 1. applies.

Robert Haas

robertmhaas@gmail.com

almost 2 years ago

In reply to: Jelte Fennema-Nio (#7)

Re: Flushing large data immediately in pqcomm

On Tue, Jan 30, 2024 at 6:39 PM Jelte Fennema-Nio <postgres@jeltef.nl> wrote:

I agree that it's hard to prove that such heuristics will always be
better in practice than the status quo. But I feel like we shouldn't
let perfect be the enemy of good here.

Sure, I agree.

I one approach that is a clear
improvement over the status quo is:
1. If the buffer is empty AND the data we are trying to send is larger
than the buffer size, then don't use the buffer.
2. If not, fill up the buffer first (just like we do now) then send
that. And if the left over data is then still larger than the buffer,
then now the buffer is empty so 1. applies.

That seems like it might be a useful refinement of Melih Mutlu's
original proposal, but consider a message stream that consists of
messages exactly 8kB in size. If that message stream begins when the
buffer is empty, all messages are sent directly. If it begins when
there are any number of bytes in the buffer, we buffer every message
forever. That's kind of an odd artifact, but maybe it's fine in
practice. I say again that it's good to test out a bunch of scenarios
and see what shakes out.

--
Robert Haas
EDB: http://www.enterprisedb.com

Jelte Fennema-Nio

postgres@jeltef.nl

almost 2 years ago

In reply to: Robert Haas (#8)

Re: Flushing large data immediately in pqcomm

On Wed, 31 Jan 2024 at 18:23, Robert Haas <robertmhaas@gmail.com> wrote:

That's kind of an odd artifact, but maybe it's fine in
practice.

I agree it's an odd artifact, but it's not a regression over the
status quo. Achieving that was the intent of my suggestion: A change
that improves some cases, but regresses nowhere.

I say again that it's good to test out a bunch of scenarios
and see what shakes out.

Testing a bunch of scenarios to find a good one sounds like a good
idea, which can probably give us a more optimal heuristic. But it also
sounds like a lot of work, and probably results in a lot of
discussion. That extra effort might mean that we're not going to
commit any change for PG17 (or even at all). If so, then I'd rather
have a modest improvement from my refinement of Melih's proposal, than
none at all.

#10

Robert Haas

robertmhaas@gmail.com

almost 2 years ago

In reply to: Jelte Fennema-Nio (#9)

Re: Flushing large data immediately in pqcomm

On Wed, Jan 31, 2024 at 12:49 PM Jelte Fennema-Nio <postgres@jeltef.nl> wrote:

Testing a bunch of scenarios to find a good one sounds like a good
idea, which can probably give us a more optimal heuristic. But it also
sounds like a lot of work, and probably results in a lot of
discussion. That extra effort might mean that we're not going to
commit any change for PG17 (or even at all). If so, then I'd rather
have a modest improvement from my refinement of Melih's proposal, than
none at all.

Personally, I don't think it's likely that anything will get committed
here without someone doing more legwork than I've seen on the thread
so far. I don't have any plan to pick up this patch anyway, but if I
were thinking about it, I would abandon the idea unless I were
prepared to go test a bunch of stuff myself. I agree with the core
idea of this work, but not with the idea that the bar is as low as "if
it can't lose relative to today, it's good enough."

Of course, another committer may see it differently.

--
Robert Haas
EDB: http://www.enterprisedb.com

#11

Melih Mutlu

m.melihmutlu@gmail.com

almost 2 years ago

In reply to: Robert Haas (#8)

Re: Flushing large data immediately in pqcomm

Robert Haas <robertmhaas@gmail.com>, 31 Oca 2024 Çar, 20:23 tarihinde şunu
yazdı:

On Tue, Jan 30, 2024 at 6:39 PM Jelte Fennema-Nio <postgres@jeltef.nl>
wrote:

I agree that it's hard to prove that such heuristics will always be
better in practice than the status quo. But I feel like we shouldn't
let perfect be the enemy of good here.

Sure, I agree.

I one approach that is a clear
improvement over the status quo is:
1. If the buffer is empty AND the data we are trying to send is larger
than the buffer size, then don't use the buffer.
2. If not, fill up the buffer first (just like we do now) then send
that. And if the left over data is then still larger than the buffer,
then now the buffer is empty so 1. applies.

That seems like it might be a useful refinement of Melih Mutlu's
original proposal, but consider a message stream that consists of
messages exactly 8kB in size. If that message stream begins when the
buffer is empty, all messages are sent directly. If it begins when
there are any number of bytes in the buffer, we buffer every message
forever. That's kind of an odd artifact, but maybe it's fine in
practice. I say again that it's good to test out a bunch of scenarios
and see what shakes out.

Isn't this already the case? Imagine sending exactly 8kB messages, the
first pq_putmessage() call will buffer 8kB. Any call after this point
simply sends a 8kB message already buffered from the previous call and
buffers a new 8kB message. Only difference here is we keep the message in
the buffer for a while instead of sending it directly. In theory, the
proposed idea should not bring any difference in the number of flushes and
the size of data we send in each time, but can remove unnecessary copies to
the buffer in this case. I guess the behaviour is also the same with or
without the patch in case the buffer has already some bytes.

Robert Haas <robertmhaas@gmail.com>, 31 Oca 2024 Çar, 21:28 tarihinde şunu
yazdı:

Personally, I don't think it's likely that anything will get committed
here without someone doing more legwork than I've seen on the thread
so far. I don't have any plan to pick up this patch anyway, but if I
were thinking about it, I would abandon the idea unless I were
prepared to go test a bunch of stuff myself. I agree with the core
idea of this work, but not with the idea that the bar is as low as "if
it can't lose relative to today, it's good enough."

You're right and I'm open to doing more legwork. I'd also appreciate any
suggestion about how to test this properly and/or useful scenarios to test.
That would be really helpful.

I understand that I should provide more/better analysis around this change
to prove that it doesn't hurt (hopefully) but improves some cases even
though not all the cases. That may even help us to find a better approach
than what's already proposed. Just to clarify, I don't think anyone here
suggests that the bar should be at "if it can't lose relative to today,
it's good enough". IMHO "a change that improves some cases, but regresses
nowhere" does not translate to that.

Thanks,
--
Melih Mutlu
Microsoft

#12

Robert Haas

robertmhaas@gmail.com

almost 2 years ago

In reply to: Melih Mutlu (#11)

Re: Flushing large data immediately in pqcomm

On Wed, Jan 31, 2024 at 2:23 PM Melih Mutlu <m.melihmutlu@gmail.com> wrote:

That seems like it might be a useful refinement of Melih Mutlu's
original proposal, but consider a message stream that consists of
messages exactly 8kB in size. If that message stream begins when the
buffer is empty, all messages are sent directly. If it begins when
there are any number of bytes in the buffer, we buffer every message
forever. That's kind of an odd artifact, but maybe it's fine in
practice. I say again that it's good to test out a bunch of scenarios
and see what shakes out.

Isn't this already the case? Imagine sending exactly 8kB messages, the first pq_putmessage() call will buffer 8kB. Any call after this point simply sends a 8kB message already buffered from the previous call and buffers a new 8kB message. Only difference here is we keep the message in the buffer for a while instead of sending it directly. In theory, the proposed idea should not bring any difference in the number of flushes and the size of data we send in each time, but can remove unnecessary copies to the buffer in this case. I guess the behaviour is also the same with or without the patch in case the buffer has already some bytes.

Yes, it's never worse than today in terms of number of buffer flushes,
but it doesn't feel like great behavior, either. Users tend not to
like it when the behavior of an algorithm depends heavily on
incidental factors that shouldn't really be relevant, like whether the
buffer starts with 1 byte in it or 0 at the beginning of a long
sequence of messages. They see the performance varying "for no reason"
and they dislike it. They don't say "even the bad performance is no
worse than earlier versions so it's fine."

You're right and I'm open to doing more legwork. I'd also appreciate any suggestion about how to test this properly and/or useful scenarios to test. That would be really helpful.

I think experimenting to see whether the long-short-long-short
behavior that Heikki postulated emerges in practice would be a really
good start.

Another experiment that I think would be interesting is: suppose you
create a patch that sends EVERY message without buffering and compare
that to master. My naive expectation would be that this will lose if
you pump short messages through that connection and win if you pump
long messages through that connection. Is that true? If yes, at what
point do we break even on performance? Does it depend on whether the
connection is local or over a network? Does it depend on whether it's
with or without SSL? Does it depend on Linux vs. Windows vs.
whateverBSD? What happens if you twiddle the 8kB buffer size up or,
say, down to just below the Ethernet frame size?

I think that what we really want to understand here is under what
circumstances the extra layer of buffering is a win vs. being a loss.
If all the stuff I just mentioned doesn't really matter and the answer
is, say, that an 8kB buffer is great and the breakpoint where extra
buffering makes sense is also 8kB, and that's consistent regardless of
other variables, then your algorithm or Jelte's variant or something
of that nature is probably just right. But if it turns out, say, that
the extra buffering is only a win for sub-1kB messages, that would be
rather nice to know before we finalize the approach. Also, if it turns
out that the answer differs dramatically based on whether you're using
a UNIX socket or TCP, that would also be nice to know before
finalizing an algorithm.

I understand that I should provide more/better analysis around this change to prove that it doesn't hurt (hopefully) but improves some cases even though not all the cases. That may even help us to find a better approach than what's already proposed. Just to clarify, I don't think anyone here suggests that the bar should be at "if it can't lose relative to today, it's good enough". IMHO "a change that improves some cases, but regresses nowhere" does not translate to that.

Well, I thought those were fairly similar sentiments, so maybe I'm not
quite understanding the statement in the way it was meant.

--
Robert Haas
EDB: http://www.enterprisedb.com

#13

Andres Freund

andres@anarazel.de

almost 2 years ago

In reply to: Robert Haas (#12)

Re: Flushing large data immediately in pqcomm

Hi,

On 2024-01-31 14:57:35 -0500, Robert Haas wrote:

You're right and I'm open to doing more legwork. I'd also appreciate any
suggestion about how to test this properly and/or useful scenarios to
test. That would be really helpful.

I think experimenting to see whether the long-short-long-short
behavior that Heikki postulated emerges in practice would be a really
good start.

Another experiment that I think would be interesting is: suppose you
create a patch that sends EVERY message without buffering and compare
that to master. My naive expectation would be that this will lose if
you pump short messages through that connection and win if you pump
long messages through that connection. Is that true? If yes, at what
point do we break even on performance? Does it depend on whether the
connection is local or over a network? Does it depend on whether it's
with or without SSL? Does it depend on Linux vs. Windows vs.
whateverBSD? What happens if you twiddle the 8kB buffer size up or,
say, down to just below the Ethernet frame size?

I feel like you're putting up a too high bar for something that can be a
pretty clear improvement on its own, without a downside. The current behaviour
is pretty absurd, doing all this research across all platforms isn't going to
disprove that - and it's a lot of work. ISTM we can analyze this without
taking concrete hardware into account easily enough.

One thing that I haven't seen mentioned here that's relevant around using
small buffers: Postgres uses TCP_NODELAY and has to do so. That means doing
tiny sends can hurt substantially

I think that what we really want to understand here is under what
circumstances the extra layer of buffering is a win vs. being a loss.

It's quite easy to see that doing no buffering isn't viable - we end up with
tiny tiny TCP packets, one for each send(). And then there's the syscall
overhead.

Here's a quickly thrown together benchmark using netperf. First with -D, which
instructs it to use TCP_NODELAY, as we do.

10gbit network, remote host:

$ (fields="request_size,throughput"; echo "$fields";for i in $(seq 0 16); do s=$((2**$i));netperf -P0 -t TCP_STREAM -l1 -H alap5-10gbe -- -r $s,$s -D 1 -o "$fields";done)|column -t -s,

request_size throughput
1 22.73
2 45.77
4 108.64
8 225.78
16 560.32
32 1035.61
64 2177.91
128 3604.71
256 5878.93
512 9334.70
1024 9031.13
2048 9405.35
4096 9334.60
8192 9275.33
16384 9406.29
32768 9385.52
65536 9399.40

localhost:
request_size throughput
1 2.76
2 5.10
4 9.89
8 20.51
16 43.42
32 87.13
64 173.72
128 343.70
256 647.89
512 1328.79
1024 2550.14
2048 4998.06
4096 9482.06
8192 17130.76
16384 29048.02
32768 42106.33
65536 48579.95

I'm slightly baffled by the poor performance of localhost with tiny packet
sizes. Ah, I see - it's the NODELA, without that:

localhost:
1 32.02
2 60.58
4 114.32
8 262.71
16 558.42
32 1053.66
64 2099.39
128 3815.60
256 6566.19
512 11751.79
1024 18976.11
2048 27222.99
4096 33838.07
8192 38219.60
16384 39146.37
32768 44784.98
65536 44214.70

NODELAY triggers many more context switches, because there's immediately data
available for the receiving side. Whereas with real network the interrupts get
coalesced.

I think that's pretty clear evidence that we need buffering. But I think we
can probably be smarter than we are right now, and then what's been proposed
in the patch. Because of TCP_NODELAY we shouldn't send a tiny buffer on its
own, it may trigger sending a small TCP packet, which is quite inefficient.

While not perfect - e.g. because networks might use jumbo packets / large MTUs
and we don't know how many outstanding bytes there are locally, I think a
decent heuristic could be to always try to send at least one packet worth of
data at once (something like ~1400 bytes), even if that requires copying some
of the input data. It might not be sent on its own, but it should make it
reasonably unlikely to end up with tiny tiny packets.

Greetings,

Andres Freund

#14

Robert Haas

robertmhaas@gmail.com

almost 2 years ago

In reply to: Andres Freund (#13)

Re: Flushing large data immediately in pqcomm

On Wed, Jan 31, 2024 at 10:24 PM Andres Freund <andres@anarazel.de> wrote:

While not perfect - e.g. because networks might use jumbo packets / large MTUs
and we don't know how many outstanding bytes there are locally, I think a
decent heuristic could be to always try to send at least one packet worth of
data at once (something like ~1400 bytes), even if that requires copying some
of the input data. It might not be sent on its own, but it should make it
reasonably unlikely to end up with tiny tiny packets.

I think that COULD be a decent heuristic but I think it should be
TESTED, including against the ~3 or so other heuristics proposed on
this thread, before we make a decision.

I literally mentioned the Ethernet frame size as one of the things
that we should test whether it's relevant in the exact email to which
you're replying, and you replied by proposing that as a heuristic, but
also criticizing me for wanting more research before we settle on
something. Are we just supposed to assume that your heuristic is
better than the others proposed here without testing anything, or,
like, what? I don't think this needs to be a completely exhaustive or
exhausting process, but I think trying a few different things out and
seeing what happens is smart.

--
Robert Haas
EDB: http://www.enterprisedb.com

#15

Robert Haas

robertmhaas@gmail.com

almost 2 years ago

In reply to: Robert Haas (#14)

Re: Flushing large data immediately in pqcomm

On Thu, Feb 1, 2024 at 10:52 AM Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Jan 31, 2024 at 10:24 PM Andres Freund <andres@anarazel.de> wrote:

While not perfect - e.g. because networks might use jumbo packets / large MTUs
and we don't know how many outstanding bytes there are locally, I think a
decent heuristic could be to always try to send at least one packet worth of
data at once (something like ~1400 bytes), even if that requires copying some
of the input data. It might not be sent on its own, but it should make it
reasonably unlikely to end up with tiny tiny packets.

I think that COULD be a decent heuristic but I think it should be
TESTED, including against the ~3 or so other heuristics proposed on
this thread, before we make a decision.

I literally mentioned the Ethernet frame size as one of the things
that we should test whether it's relevant in the exact email to which
you're replying, and you replied by proposing that as a heuristic, but
also criticizing me for wanting more research before we settle on
something. Are we just supposed to assume that your heuristic is
better than the others proposed here without testing anything, or,
like, what? I don't think this needs to be a completely exhaustive or
exhausting process, but I think trying a few different things out and
seeing what happens is smart.

There was probably a better way to phrase this email ... the sentiment
is sincere, but there was almost certainly a way of writing it that
didn't sound like I'm super-annoyed.

Apologies for that.

--
Robert Haas
EDB: http://www.enterprisedb.com

#16

Andres Freund

andres@anarazel.de

almost 2 years ago

In reply to: Robert Haas (#15)

Re: Flushing large data immediately in pqcomm

Hi,

On 2024-02-01 15:02:57 -0500, Robert Haas wrote:

On Thu, Feb 1, 2024 at 10:52 AM Robert Haas <robertmhaas@gmail.com> wrote:
There was probably a better way to phrase this email ... the sentiment
is sincere, but there was almost certainly a way of writing it that
didn't sound like I'm super-annoyed.

NP - I could have phrased mine better as well...

On Wed, Jan 31, 2024 at 10:24 PM Andres Freund <andres@anarazel.de> wrote:

While not perfect - e.g. because networks might use jumbo packets / large MTUs
and we don't know how many outstanding bytes there are locally, I think a
decent heuristic could be to always try to send at least one packet worth of
data at once (something like ~1400 bytes), even if that requires copying some
of the input data. It might not be sent on its own, but it should make it
reasonably unlikely to end up with tiny tiny packets.

I think that COULD be a decent heuristic but I think it should be
TESTED, including against the ~3 or so other heuristics proposed on
this thread, before we make a decision.

I literally mentioned the Ethernet frame size as one of the things
that we should test whether it's relevant in the exact email to which
you're replying, and you replied by proposing that as a heuristic, but
also criticizing me for wanting more research before we settle on
something.

I mentioned the frame size thing because afaict nobody in the thread had
mentioned our use of TCP_NODELAY (which basically forces the kernel to send
out data immediately instead of waiting for further data to be sent). Without
that it'd be a lot less problematic to occasionally send data in small
increments inbetween larger sends. Nor would packet sizes be as relevant.

Are we just supposed to assume that your heuristic is better than the
others proposed here without testing anything, or, like, what? I don't
think this needs to be a completely exhaustive or exhausting process, but
I think trying a few different things out and seeing what happens is
smart.

I wasn't trying to say that my heuristic necessarily is better. What I was
trying to get at - and expressed badly - was that I doubt that testing can get
us all that far here. It's not too hard to test the effects of our buffering
with regards to syscall overhead, but once you actually take network effects
into account it gets quite hard. Bandwidth, latency, the specific network
hardware and operating systems involved all play a significant role. Given
how, uh, naive our current approach is, I think analyzing the situation from
first principles and then doing some basic validation of the results makes
more sense.

Separately, I think we shouldn't aim for perfect here. It's obviously
extremely inefficient to send a larger amount of data by memcpy()ing and
send()ing it in 8kB chunks. As mentioned by several folks upthread, we can
improve upon that without having worse behaviour than today. Medium-long term
I suspect we're going to want to use asynchronous network interfaces, in
combination with zero-copy sending, which requires larger changes. Not that
relevant for things like query results, quite relevant for base backups etc.

It's perhaps also worth mentioning that the small send buffer isn't great for
SSL performance, the encryption overhead increases when sending in small
chunks.

I hacked up Melih's patch to send the pending data together with the first bit
of the large "to be sent" data and also added a patch to increased
SINK_BUFFER_LENGTH by 16x. With a 12GB database I tested the time for
pg_basebackup -c fast -Ft --compress=none -Xnone -D - -d "$conn" > /dev/null

time via
test unix tcp tcp+ssl
master 6.305s 9.436s 15.596s
master-larger-buffer 6.535s 9.453s 15.208s
patch 5.900s 7.465s 13.634s
patch-larger-buffer 5.233s 5.439s 11.730s

The increase when using tcp is pretty darn impressive. If I had remembered in
time to disable manifests checksums, the win would have been even bigger.

The bottleneck for SSL is that it still ends up with ~16kB sends, not sure
why.

Greetings,

Andres Freund

#17

Melih Mutlu

m.melihmutlu@gmail.com

almost 2 years ago

In reply to: Andres Freund (#16)

3 attachment(s)

Re: Flushing large data immediately in pqcomm

Hi hackers,

I did some experiments with this patch, after previous discussions. This
probably does not answer all questions, but would be happy to do more if
needed.

First, I updated the patch according to what suggested here [1]/messages/by-id/CAGECzQTYUhnC1bO=zNiSpUgCs=hCYxVHvLD2doXNx3My6ZAC2w@mail.gmail.com. PSA v2.
I tweaked the master branch a bit to not allow any buffering. I compared
HEAD, this patch and no buffering at all.
I also added a simple GUC to control PqSendBufferSize, this change only
allows to modify the buffer size and should not have any impact on
performance.

I again ran the COPY TO STDOUT command and timed it. AFAIU COPY sends data
row by row, and I tried running the command under different scenarios with
different # of rows and row sizes. You can find the test script attached
(see test.sh).
All timings are in ms.

1- row size = 100 bytes, # of rows = 1000000
┌───────────┬────────────┬──────┬──────┬──────┬──────┬──────┐
│ │ 1400 bytes │ 2KB │ 4KB │ 8KB │ 16KB │ 32KB │
├───────────┼────────────┼──────┼──────┼──────┼──────┼──────┤
│ HEAD │ 1036 │ 998 │ 940 │ 910 │ 894 │ 874 │
├───────────┼────────────┼──────┼──────┼──────┼──────┼──────┤
│ patch │ 1107 │ 1032 │ 980 │ 957 │ 917 │ 909 │
├───────────┼────────────┼──────┼──────┼──────┼──────┼──────┤
│ no buffer │ 6230 │ 6125 │ 6282 │ 6279 │ 6255 │ 6221 │
└───────────┴────────────┴──────┴──────┴──────┴──────┴──────┘

2- row size = half of the rows are 1KB and rest is 10KB , # of rows =
1000000
┌───────────┬────────────┬───────┬───────┬───────┬───────┬───────┐
│ │ 1400 bytes │ 2KB │ 4KB │ 8KB │ 16KB │ 32KB │
├───────────┼────────────┼───────┼───────┼───────┼───────┼───────┤
│ HEAD │ 25197 │ 23414 │ 20612 │ 19206 │ 18334 │ 18033 │
├───────────┼────────────┼───────┼───────┼───────┼───────┼───────┤
│ patch │ 19843 │ 19889 │ 19798 │ 19129 │ 18578 │ 18260 │
├───────────┼────────────┼───────┼───────┼───────┼───────┼───────┤
│ no buffer │ 23752 │ 23565 │ 23602 │ 23622 │ 23541 │ 23599 │
└───────────┴────────────┴───────┴───────┴───────┴───────┴───────┘

3- row size = half of the rows are 1KB and rest is 1MB , # of rows = 1000
┌───────────┬────────────┬──────┬──────┬──────┬──────┬──────┐
│ │ 1400 bytes │ 2KB │ 4KB │ 8KB │ 16KB │ 32KB │
├───────────┼────────────┼──────┼──────┼──────┼──────┼──────┤
│ HEAD │ 3137 │ 2937 │ 2687 │ 2551 │ 2456 │ 2465 │
├───────────┼────────────┼──────┼──────┼──────┼──────┼──────┤
│ patch │ 2399 │ 2390 │ 2402 │ 2415 │ 2417 │ 2422 │
├───────────┼────────────┼──────┼──────┼──────┼──────┼──────┤
│ no buffer │ 2417 │ 2414 │ 2429 │ 2418 │ 2435 │ 2404 │
└───────────┴────────────┴──────┴──────┴──────┴──────┴──────┘

3- row size = all rows are 1MB , # of rows = 1000
┌───────────┬────────────┬──────┬──────┬──────┬──────┬──────┐
│ │ 1400 bytes │ 2KB │ 4KB │ 8KB │ 16KB │ 32KB │
├───────────┼────────────┼──────┼──────┼──────┼──────┼──────┤
│ HEAD │ 6113 │ 5764 │ 5281 │ 5009 │ 4885 │ 4872 │
├───────────┼────────────┼──────┼──────┼──────┼──────┼──────┤
│ patch │ 4759 │ 4754 │ 4754 │ 4758 │ 4782 │ 4805 │
├───────────┼────────────┼──────┼──────┼──────┼──────┼──────┤
│ no buffer │ 4756 │ 4774 │ 4793 │ 4766 │ 4770 │ 4774 │
└───────────┴────────────┴──────┴──────┴──────┴──────┴──────┘

Some quick observations:
1- Even though I expect both the patch and HEAD behave similarly in case of
small data (case 1: 100 bytes), the patch runs slightly slower than HEAD.
2- In cases where the data does not fit into the buffer, the patch starts
performing better than HEAD. For example, in case 2, patch seems faster
until the buffer size exceeds the data length. When the buffer size is set
to something larger than 10KB (16KB/32KB in this case), there is again a
slight performance loss with the patch as in case 1.
3- With large row sizes (i.e. sizes that do not fit into the buffer) not
buffering at all starts performing better than HEAD. Similarly the patch
performs better too as it stops buffering if data does not fit into the
buffer.

[1]: /messages/by-id/CAGECzQTYUhnC1bO=zNiSpUgCs=hCYxVHvLD2doXNx3My6ZAC2w@mail.gmail.com
/messages/by-id/CAGECzQTYUhnC1bO=zNiSpUgCs=hCYxVHvLD2doXNx3My6ZAC2w@mail.gmail.com

Thanks,
--
Melih Mutlu
Microsoft

Attachments:

add-pq_send_buffer_size-GUC.patchapplication/x-patch; name=add-pq_send_buffer_size-GUC.patchDownload

From 7f1b72a0948156f8e35ce3b07b5e763a5578d641 Mon Sep 17 00:00:00 2001
From: Melih Mutlu <m.melihmutlu@gmail.com>
Date: Mon, 26 Feb 2024 14:41:35 +0300
Subject: [PATCH] Added pq_send_buffer_size GUC

---
 src/backend/libpq/pqcomm.c          |  2 +-
 src/backend/utils/misc/guc_tables.c | 11 +++++++++++
 src/include/libpq/libpq.h           |  1 +
 3 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c
index c606bf3447..92708e46e6 100644
--- a/src/backend/libpq/pqcomm.c
+++ b/src/backend/libpq/pqcomm.c
@@ -116,7 +116,7 @@ static List *sock_paths = NIL;
  * enlarged by pq_putmessage_noblock() if the message doesn't fit otherwise.
  */
 
-#define PQ_SEND_BUFFER_SIZE 8192
+int PQ_SEND_BUFFER_SIZE = 8192;
 #define PQ_RECV_BUFFER_SIZE 8192
 
 static char *PqSendBuffer;
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 527a2b2734..1f198b19ca 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -3595,6 +3595,17 @@ struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"pq_send_buffer_size", PGC_USERSET, DEVELOPER_OPTIONS,
+			gettext_noop("Sets the PqSendBufferSize"),
+			NULL,
+			GUC_NOT_IN_SAMPLE | GUC_UNIT_BYTE
+		},
+		&PQ_SEND_BUFFER_SIZE,
+		8192, 0, MAX_KILOBYTES,
+		NULL, NULL, NULL
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, 0, 0, 0, NULL, NULL, NULL
diff --git a/src/include/libpq/libpq.h b/src/include/libpq/libpq.h
index 6171a0d17a..37e91008fb 100644
--- a/src/include/libpq/libpq.h
+++ b/src/include/libpq/libpq.h
@@ -123,6 +123,7 @@ extern PGDLLIMPORT char *SSLECDHCurve;
 extern PGDLLIMPORT bool SSLPreferServerCiphers;
 extern PGDLLIMPORT int ssl_min_protocol_version;
 extern PGDLLIMPORT int ssl_max_protocol_version;
+extern PGDLLIMPORT int PQ_SEND_BUFFER_SIZE;
 
 enum ssl_protocol_versions
 {
-- 
2.34.1

test.shtext/x-sh; charset=US-ASCII; name=test.shDownload

v2-0001-Flush-large-data-immediately-in-pqcomm.patchapplication/octet-stream; name=v2-0001-Flush-large-data-immediately-in-pqcomm.patchDownload

From caba80f3d29eb199e966575a6d05f5a2941355fb Mon Sep 17 00:00:00 2001
From: Melih Mutlu <m.melihmutlu@gmail.com>
Date: Mon, 20 Nov 2023 11:20:52 +0300
Subject: [PATCH v2] Flush large data immediately in pqcomm

If the data is larger than send buffer size in pqcomm, we're sure that
the send buffer will be flushed at least once to fit the data depending
on how large the data is.

Instead of repeatedly calling memcpy and then flushing data larger than
the available space in the send buffer, this patch changes
internal_putbytes() logic to flush large data immediately if the buffer
is empty without unnecessarily copying it into the pqcomm send buffer.
---
 src/backend/libpq/pqcomm.c | 59 +++++++++++++++++++++++++++++---------
 1 file changed, 46 insertions(+), 13 deletions(-)

diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c
index 6497100a1a..7c54745f69 100644
--- a/src/backend/libpq/pqcomm.c
+++ b/src/backend/libpq/pqcomm.c
@@ -145,6 +145,7 @@ static int	socket_putmessage(char msgtype, const char *s, size_t len);
 static void socket_putmessage_noblock(char msgtype, const char *s, size_t len);
 static int	internal_putbytes(const char *s, size_t len);
 static int	internal_flush(void);
+static int	internal_flush_buffer(const char *s, int *start, int *end);
 
 static int	Lock_AF_UNIX(const char *unixSocketDir, const char *unixSocketPath);
 static int	Setup_AF_UNIX(const char *sock_path);
@@ -1282,14 +1283,32 @@ internal_putbytes(const char *s, size_t len)
 			if (internal_flush())
 				return EOF;
 		}
-		amount = PqSendBufferSize - PqSendPointer;
-		if (amount > len)
-			amount = len;
-		memcpy(PqSendBuffer + PqSendPointer, s, amount);
-		PqSendPointer += amount;
-		s += amount;
-		len -= amount;
+
+		/*
+		 * If the buffer is empty and data length is larger than the buffer
+		 * size, send it without buffering. Otherwise, put as much data as
+		 * possible into the buffer.
+		 */
+		if (!pq_is_send_pending() && len >= PqSendBufferSize)
+		{
+			int start = 0;
+
+			socket_set_nonblocking(false);
+			if (internal_flush_buffer(s, &start, (int *)&len))
+				return EOF;
+		}
+		else
+		{
+			amount = PqSendBufferSize - PqSendPointer;
+			if (amount > len)
+				amount = len;
+			memcpy(PqSendBuffer + PqSendPointer, s, amount);
+			PqSendPointer += amount;
+			s += amount;
+			len -= amount;
+		}
 	}
+
 	return 0;
 }
 
@@ -1323,11 +1342,25 @@ socket_flush(void)
  */
 static int
 internal_flush(void)
+{
+	/* flush the pending output from send buffer. */
+	return internal_flush_buffer(PqSendBuffer, &PqSendStart, &PqSendPointer);
+}
+
+/* --------------------------------
+ *		internal_flush_buffer - flush the given buffer content
+ *
+ * Returns 0 if OK (meaning everything was sent, or operation would block
+ * and the socket is in non-blocking mode), or EOF if trouble.
+ * --------------------------------
+ */
+static int
+internal_flush_buffer(const char *s, int *start, int *end)
 {
 	static int	last_reported_send_errno = 0;
 
-	char	   *bufptr = PqSendBuffer + PqSendStart;
-	char	   *bufend = PqSendBuffer + PqSendPointer;
+	char	   *bufptr = (char*) s + *start;
+	char	   *bufend = (char*) s + *end;
 
 	while (bufptr < bufend)
 	{
@@ -1347,7 +1380,7 @@ internal_flush(void)
 			if (errno == EAGAIN ||
 				errno == EWOULDBLOCK)
 			{
-				return 0;
+								return 0;
 			}
 
 			/*
@@ -1373,7 +1406,7 @@ internal_flush(void)
 			 * flag that'll cause the next CHECK_FOR_INTERRUPTS to terminate
 			 * the connection.
 			 */
-			PqSendStart = PqSendPointer = 0;
+			*start = *end = 0;
 			ClientConnectionLost = 1;
 			InterruptPending = 1;
 			return EOF;
@@ -1381,10 +1414,10 @@ internal_flush(void)
 
 		last_reported_send_errno = 0;	/* reset after any successful send */
 		bufptr += r;
-		PqSendStart += r;
+		*start += r;
 	}
 
-	PqSendStart = PqSendPointer = 0;
+	*start = *end = 0;
 	return 0;
 }
 
-- 
2.34.1

#18

Robert Haas

robertmhaas@gmail.com

almost 2 years ago

In reply to: Melih Mutlu (#17)

Re: Flushing large data immediately in pqcomm

On Thu, Mar 14, 2024 at 7:22 AM Melih Mutlu <m.melihmutlu@gmail.com> wrote:

1- Even though I expect both the patch and HEAD behave similarly in case of small data (case 1: 100 bytes), the patch runs slightly slower than HEAD.

I wonder why this happens. It seems like maybe something that could be fixed.

--
Robert Haas
EDB: http://www.enterprisedb.com

#19

Jelte Fennema-Nio

postgres@jeltef.nl

almost 2 years ago

In reply to: Melih Mutlu (#17)

Re: Flushing large data immediately in pqcomm

On Thu, 14 Mar 2024 at 12:22, Melih Mutlu <m.melihmutlu@gmail.com> wrote:

I did some experiments with this patch, after previous discussions

One thing I noticed is that the buffer sizes don't seem to matter much
in your experiments, even though Andres his expectation was that 1400
would be better. I think I know the reason for that:

afaict from your test.sh script you connect psql over localhost or
maybe even unix socket to postgres. Neither of those would not have an
MTU of 1500. You'd probably want to do those tests over an actual
network or at least change the MTU of the loopback interface. e.g. my
"lo" interface mtu is 65536 by default:

❯ ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN
group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever

#20

Heikki Linnakangas

hlinnaka@iki.fi

almost 2 years ago

In reply to: Melih Mutlu (#17)

Re: Flushing large data immediately in pqcomm

On 14/03/2024 13:22, Melih Mutlu wrote:

@@ -1282,14 +1283,32 @@ internal_putbytes(const char *s, size_t len)
if (internal_flush())
return EOF;
}
-		amount = PqSendBufferSize - PqSendPointer;
-		if (amount > len)
-			amount = len;
-		memcpy(PqSendBuffer + PqSendPointer, s, amount);
-		PqSendPointer += amount;
-		s += amount;
-		len -= amount;
+
+		/*
+		 * If the buffer is empty and data length is larger than the buffer
+		 * size, send it without buffering. Otherwise, put as much data as
+		 * possible into the buffer.
+		 */
+		if (!pq_is_send_pending() && len >= PqSendBufferSize)
+		{
+			int start = 0;
+
+			socket_set_nonblocking(false);
+			if (internal_flush_buffer(s, &start, (int *)&len))
+				return EOF;
+		}
+		else
+		{
+			amount = PqSendBufferSize - PqSendPointer;
+			if (amount > len)
+				amount = len;
+			memcpy(PqSendBuffer + PqSendPointer, s, amount);
+			PqSendPointer += amount;
+			s += amount;
+			len -= amount;
+		}
}
+
return 0;
}

Two small bugs:

- the "(int *) &len)" cast is not ok, and will break visibly on
big-endian systems where sizeof(int) != sizeof(size_t).

- If internal_flush_buffer() cannot write all the data in one call, it
updates 'start' for how much it wrote, and leaves 'end' unchanged. You
throw the updated 'start' value away, and will send the same data again
on next iteration.

Not a correctness issue, but instead of pq_is_send_pending(), I think it
would be better to check "PqSendStart == PqSendPointer" directly, or
call socket_is_send_pending() directly here. pq_is_send_pending() does
the same, but it's at a higher level of abstraction.

--
Heikki Linnakangas
Neon (https://neon.tech)

#21

Jelte Fennema-Nio

postgres@jeltef.nl

almost 2 years ago

In reply to: Robert Haas (#18)

Re: Flushing large data immediately in pqcomm

On Thu, 14 Mar 2024 at 13:12, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Mar 14, 2024 at 7:22 AM Melih Mutlu <m.melihmutlu@gmail.com> wrote:

1- Even though I expect both the patch and HEAD behave similarly in case of small data (case 1: 100 bytes), the patch runs slightly slower than HEAD.

I wonder why this happens. It seems like maybe something that could be fixed.

some wild guesses:
1. maybe it's the extra call overhead of the new internal_flush
implementation. What happens if you make that an inline function?
2. maybe swap these conditions around (the call seems heavier than a
simple comparison): !pq_is_send_pending() && len >= PqSendBufferSize

BTW, the improvements for the larger rows are awesome!

#22

David Rowley

dgrowleyml@gmail.com

almost 2 years ago

In reply to: Jelte Fennema-Nio (#21)

Re: Flushing large data immediately in pqcomm

On Fri, 15 Mar 2024 at 02:03, Jelte Fennema-Nio <postgres@jeltef.nl> wrote:

On Thu, 14 Mar 2024 at 13:12, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Mar 14, 2024 at 7:22 AM Melih Mutlu <m.melihmutlu@gmail.com> wrote:

1- Even though I expect both the patch and HEAD behave similarly in case of small data (case 1: 100 bytes), the patch runs slightly slower than HEAD.

I wonder why this happens. It seems like maybe something that could be fixed.

some wild guesses:
1. maybe it's the extra call overhead of the new internal_flush
implementation. What happens if you make that an inline function?
2. maybe swap these conditions around (the call seems heavier than a
simple comparison): !pq_is_send_pending() && len >= PqSendBufferSize

I agree these are both worth trying. For #2, I wonder if the
pq_is_send_pending() call is even worth checking at all. It seems to
me that the internal_flush_buffer() code will just do nothing if
nothing is pending. Also, isn't there almost always going to be
something pending when the "len >= PqSendBufferSize" condition is met?
We've just added the msgtype and number of bytes to the buffer which
is 5 bytes. If the previous message was also more than
PqSendBufferSize, then the buffer is likely to have 5 bytes due to the
previous flush, otherwise isn't it a 1 in 8192 chance that the buffer
is empty?

If that fails to resolve the regression, maybe it's worth memcpy()ing
enough bytes out of the message to fill the buffer then flush it and
check if we still have > PqSendBufferSize remaining and skip the
memcpy() for the rest. That way there are no small flushes of just 5
bytes and only ever the possibility of reducing the flushes as no
pattern should cause the number of flushes to increase.

David

#23

David Rowley

dgrowleyml@gmail.com

almost 2 years ago

In reply to: Heikki Linnakangas (#20)

Re: Flushing large data immediately in pqcomm

On Fri, 15 Mar 2024 at 01:46, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

- the "(int *) &len)" cast is not ok, and will break visibly on
big-endian systems where sizeof(int) != sizeof(size_t).

I think fixing this requires adjusting the signature of
internal_flush_buffer() to use size_t instead of int. That also
means that PqSendStart and PqSendPointer must also become size_t, or
internal_flush() must add local size_t variables to pass to
internal_flush_buffer and assign these back again to the global after
the call. Upgrading the globals might be the cleaner option.

David

#24

Melih Mutlu

m.melihmutlu@gmail.com

almost 2 years ago

In reply to: David Rowley (#22)

Re: Flushing large data immediately in pqcomm

David Rowley <dgrowleyml@gmail.com>, 21 Mar 2024 Per, 00:54 tarihinde şunu
yazdı:

On Fri, 15 Mar 2024 at 02:03, Jelte Fennema-Nio <postgres@jeltef.nl>
wrote:

On Thu, 14 Mar 2024 at 13:12, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Mar 14, 2024 at 7:22 AM Melih Mutlu <m.melihmutlu@gmail.com>

wrote:

1- Even though I expect both the patch and HEAD behave similarly in

case of small data (case 1: 100 bytes), the patch runs slightly slower than
HEAD.

I wonder why this happens. It seems like maybe something that could be

fixed.

some wild guesses:
1. maybe it's the extra call overhead of the new internal_flush
implementation. What happens if you make that an inline function?
2. maybe swap these conditions around (the call seems heavier than a
simple comparison): !pq_is_send_pending() && len >= PqSendBufferSize

I agree these are both worth trying. For #2, I wonder if the
pq_is_send_pending() call is even worth checking at all. It seems to
me that the internal_flush_buffer() code will just do nothing if
nothing is pending. Also, isn't there almost always going to be
something pending when the "len >= PqSendBufferSize" condition is met?
We've just added the msgtype and number of bytes to the buffer which
is 5 bytes. If the previous message was also more than
PqSendBufferSize, then the buffer is likely to have 5 bytes due to the
previous flush, otherwise isn't it a 1 in 8192 chance that the buffer
is empty?

If that fails to resolve the regression, maybe it's worth memcpy()ing
enough bytes out of the message to fill the buffer then flush it and
check if we still have > PqSendBufferSize remaining and skip the
memcpy() for the rest. That way there are no small flushes of just 5
bytes and only ever the possibility of reducing the flushes as no
pattern should cause the number of flushes to increase.

In len > PqSendBufferSize cases, the buffer should be filled as much as
possible if we're sure that it will be flushed at some point. Otherwise we
might end up with small flushes. The cases where we're sure that the buffer
will be flushed is when the buffer is not empty. If it's empty, there is no
need to fill it unnecessarily as it might cause an additional flush. AFAIU
from what you said, we shouldn't be worried about such a case since it's
unlikely to have the buffer empty due to the first 5 bytes. I guess the
only case where the buffer can be empty is when the buffer has
PqSendBufferSize-5
bytes from previous messages and adding 5 bytes of the current message will
flush the buffer. I'm not sure if removing the check may cause any
regression in any case, but it's just there to be safe.

What if I do a simple comparison like PqSendStart == PqSendPointer instead
of calling pq_is_send_pending() as Heikki suggested, then this check should
not hurt that much. Right? Does that make sense?

--
Melih Mutlu
Microsoft

#25

David Rowley

dgrowleyml@gmail.com

almost 2 years ago

In reply to: Melih Mutlu (#24)

Re: Flushing large data immediately in pqcomm

On Thu, 21 Mar 2024 at 13:24, Melih Mutlu <m.melihmutlu@gmail.com> wrote:

What if I do a simple comparison like PqSendStart == PqSendPointer instead of calling pq_is_send_pending() as Heikki suggested, then this check should not hurt that much. Right? Does that make sense?

As I understand the code, there's no problem calling
internal_flush_buffer() when the buffer is empty and I suspect that if
we're sending a few buffers with "len > PqSendBufferSize" that it's
just so unlikely that the buffer is empty that we should just do the
function call and let internal_flush_buffer() handle doing nothing if
the buffer really is empty. I think the chances of
internal_flush_buffer() having to do exactly nothing here is less than
1 in 8192, so I just don't think the check is worthwhile. The reason
I don't think the odds are exactly 1 in 8192 is because if we're
sending a large number of bytes then it will be common that the buffer
will contain exactly 5 bytes due to the previous flush and command
prefix just having been added.

It's worth testing both, however. I might be wrong. Performance is
hard to predict. It would be good to see your test.sh script run with
and without the PqSendStart == PqSendPointer condition.

David

#26

Jelte Fennema-Nio

postgres@jeltef.nl

almost 2 years ago

In reply to: David Rowley (#25)

Re: Flushing large data immediately in pqcomm

On Thu, 21 Mar 2024 at 01:45, David Rowley <dgrowleyml@gmail.com> wrote:

As I understand the code, there's no problem calling
internal_flush_buffer() when the buffer is empty and I suspect that if
we're sending a few buffers with "len > PqSendBufferSize" that it's
just so unlikely that the buffer is empty that we should just do the
function call and let internal_flush_buffer() handle doing nothing if
the buffer really is empty. I think the chances of
internal_flush_buffer() having to do exactly nothing here is less than
1 in 8192, so I just don't think the check is worthwhile.

I think you're missing the exact case that we're trying to improve
here: Calls to internal_putbytes with a very large len, e.g. 1MB.
With the new code the buffer will be empty ~50% of the time (not less
than 1 in 8192) with such large buffers, because the flow that will
happen:

1. We check len > PqSendBufferSize. There are some bytes in the buffer
e.g. the 5 bytes of the msgtype. So we fill up the buffer, but have
many bytes left in len.
2. We loop again, because len is not 0.
3. We flush the buffer (at the top of the loop) because the buffer is full.
4. We check len > PqSendBufferSize. Now the buffer is empty, so we
call internal_flush_buffer directly

As you can see we check len > PqSendBufferSize twice (in step 1. and
step 4.), and 1 out of 2 times it returns 0

To be clear, the code is done this way so our behaviour would only
ever be better than the status-quo, and cause no regressions. For
instance, flushing the 5 byte header separately and then flushing the
full input buffer might result in more IP packets being sent in total
in some cases due to our TCP_NODELAY.

#27

Jelte Fennema-Nio

postgres@jeltef.nl

almost 2 years ago

In reply to: Melih Mutlu (#24)

Re: Flushing large data immediately in pqcomm

On Thu, 21 Mar 2024 at 01:24, Melih Mutlu <m.melihmutlu@gmail.com> wrote:

What if I do a simple comparison like PqSendStart == PqSendPointer instead of calling pq_is_send_pending()

Yeah, that sounds worth trying out. So the new suggestions to fix the
perf issues on small message sizes would be:

1. add "inline" to internal_flush function
2. replace pq_is_send_pending() with PqSendStart == PqSendPointer
3. (optional) swap the order of PqSendStart == PqSendPointer and len

Show quoted text

= PqSendBufferSize

#28

David Rowley

dgrowleyml@gmail.com

almost 2 years ago

In reply to: Jelte Fennema-Nio (#26)

Re: Flushing large data immediately in pqcomm

On Thu, 21 Mar 2024 at 22:44, Jelte Fennema-Nio <postgres@jeltef.nl> wrote:

On Thu, 21 Mar 2024 at 01:45, David Rowley <dgrowleyml@gmail.com> wrote:

As I understand the code, there's no problem calling
internal_flush_buffer() when the buffer is empty and I suspect that if
we're sending a few buffers with "len > PqSendBufferSize" that it's
just so unlikely that the buffer is empty that we should just do the
function call and let internal_flush_buffer() handle doing nothing if
the buffer really is empty. I think the chances of
internal_flush_buffer() having to do exactly nothing here is less than
1 in 8192, so I just don't think the check is worthwhile.

I think you're missing the exact case that we're trying to improve
here: Calls to internal_putbytes with a very large len, e.g. 1MB.
With the new code the buffer will be empty ~50% of the time (not less
than 1 in 8192) with such large buffers, because the flow that will
happen:

It was the code I misread. I understand what the aim is. I failed to
notice the while loop in internal_putbytes(). So what I mentioned
about trying to fill the buffer before flushing already happens. I
now agree that the PqSendStart == PqSendPointer test. I'd say since
the reported regression was with 100 byte rows that testing "len >=
PqSendBufferSize" before PqSendStart == PqSendPointer makes sense.

David

#29

Melih Mutlu

m.melihmutlu@gmail.com

almost 2 years ago

In reply to: Heikki Linnakangas (#20)

Re: Flushing large data immediately in pqcomm

Heikki Linnakangas <hlinnaka@iki.fi>, 14 Mar 2024 Per, 15:46 tarihinde şunu
yazdı:

On 14/03/2024 13:22, Melih Mutlu wrote:

@@ -1282,14 +1283,32 @@ internal_putbytes(const char *s, size_t len)
if (internal_flush())
return EOF;
}
-             amount = PqSendBufferSize - PqSendPointer;
-             if (amount > len)
-                     amount = len;
-             memcpy(PqSendBuffer + PqSendPointer, s, amount);
-             PqSendPointer += amount;
-             s += amount;
-             len -= amount;
+
+             /*
+              * If the buffer is empty and data length is larger than

the buffer

+ * size, send it without buffering. Otherwise, put as much

data as

+              * possible into the buffer.
+              */
+             if (!pq_is_send_pending() && len >= PqSendBufferSize)
+             {
+                     int start = 0;
+
+                     socket_set_nonblocking(false);
+                     if (internal_flush_buffer(s, &start, (int *)&len))
+                             return EOF;
+             }
+             else
+             {
+                     amount = PqSendBufferSize - PqSendPointer;
+                     if (amount > len)
+                             amount = len;
+                     memcpy(PqSendBuffer + PqSendPointer, s, amount);
+                     PqSendPointer += amount;
+                     s += amount;
+                     len -= amount;
+             }
}
+
return 0;
}

Two small bugs:

- the "(int *) &len)" cast is not ok, and will break visibly on
big-endian systems where sizeof(int) != sizeof(size_t).

There are two possible options for internal_flush_buffer() in
internal_putbytes() case:
1- Write all the data and return 0. We don't need start or end of the data
in this case.
2- Cannot write all and return EOF. In this case internal_putbytes() also
returns EOF immediately and does not really retry. There will be no next
iteration.

If it was non-blocking, then we may need to keep the new value. But I think
we do not need the updated start value in both cases here. What do you
think?

Thanks,
--
Melih Mutlu
Microsoft

#30

Melih Mutlu

m.melihmutlu@gmail.com

almost 2 years ago

In reply to: Jelte Fennema-Nio (#27)

1 attachment(s)

Re: Flushing large data immediately in pqcomm

Hi,

PSA v3.

Jelte Fennema-Nio <postgres@jeltef.nl>, 21 Mar 2024 Per, 12:58 tarihinde
şunu yazdı:

On Thu, 21 Mar 2024 at 01:24, Melih Mutlu <m.melihmutlu@gmail.com> wrote:

What if I do a simple comparison like PqSendStart == PqSendPointer

instead of calling pq_is_send_pending()

Yeah, that sounds worth trying out. So the new suggestions to fix the
perf issues on small message sizes would be:

1. add "inline" to internal_flush function
2. replace pq_is_send_pending() with PqSendStart == PqSendPointer
3. (optional) swap the order of PqSendStart == PqSendPointer and len

= PqSendBufferSize

I did all of the above changes and it seems like those resolved the
regression issue.
Since the previous results were with unix sockets, I share here the results
of v3 when using unix sockets for comparison.
Sharing only the case where all messages are 100 bytes, since this was when
the regression was most visible.

row size = 100 bytes, # of rows = 1000000
┌───────────┬────────────┬──────┬──────┬──────┬──────┬──────┐
│ │ 1400 bytes │ 2KB │ 4KB │ 8KB │ 16KB │ 32KB │
├───────────┼────────────┼──────┼──────┼──────┼──────┼──────┤
│ HEAD │ 1106 │ 1006 │ 947 │ 920 │ 899 │ 888 │
├───────────┼────────────┼──────┼──────┼──────┼──────┼──────┤
│ patch │ 1094 │ 997 │ 943 │ 913 │ 894 │ 881 │
├───────────┼────────────┼──────┼──────┼──────┼──────┼──────┤
│ no buffer │ 6389 │ 6195 │ 6214 │ 6271 │ 6325 │ 6211 │
└───────────┴────────────┴──────┴──────┴──────┴──────┴──────┘

David Rowley <dgrowleyml@gmail.com>, 21 Mar 2024 Per, 00:57 tarihinde şunu
yazdı:

On Fri, 15 Mar 2024 at 01:46, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

- the "(int *) &len)" cast is not ok, and will break visibly on
big-endian systems where sizeof(int) != sizeof(size_t).

I think fixing this requires adjusting the signature of
internal_flush_buffer() to use size_t instead of int. That also
means that PqSendStart and PqSendPointer must also become size_t, or
internal_flush() must add local size_t variables to pass to
internal_flush_buffer and assign these back again to the global after
the call. Upgrading the globals might be the cleaner option.

David

This is done too.

I actually tried to test it over a real network for a while. However, I
couldn't get reliable-enough numbers with both HEAD and the patch due to
network related issues.
I've decided to go with Jelte's suggestion [1]/messages/by-id/CAGECzQQMktuTj8ijJgBRXCwLEqfJyAFxg1h7rCTej-6=cR0r=Q@mail.gmail.com which is decreasing MTU of
the loopback interface to 1500 and using localhost.

Here are the results:

1- row size = 100 bytes, # of rows = 1000000
┌───────────┬────────────┬───────┬───────┬───────┬───────┬───────┐
│ │ 1400 bytes │ 2KB │ 4KB │ 8KB │ 16KB │ 32KB │
├───────────┼────────────┼───────┼───────┼───────┼───────┼───────┤
│ HEAD │ 1351 │ 1233 │ 1074 │ 988 │ 944 │ 916 │
├───────────┼────────────┼───────┼───────┼───────┼───────┼───────┤
│ patch │ 1369 │ 1232 │ 1073 │ 981 │ 928 │ 907 │
├───────────┼────────────┼───────┼───────┼───────┼───────┼───────┤
│ no buffer │ 14949 │ 14533 │ 14791 │ 14864 │ 14612 │ 14751 │
└───────────┴────────────┴───────┴───────┴───────┴───────┴───────┘

2- row size = half of the rows are 1KB and rest is 10KB , # of rows =
1000000
┌───────────┬────────────┬───────┬───────┬───────┬───────┬───────┐
│ │ 1400 bytes │ 2KB │ 4KB │ 8KB │ 16KB │ 32KB │
├───────────┼────────────┼───────┼───────┼───────┼───────┼───────┤
│ HEAD │ 37212 │ 31372 │ 25520 │ 21980 │ 20311 │ 18864 │
├───────────┼────────────┼───────┼───────┼───────┼───────┼───────┤
│ patch │ 23006 │ 23127 │ 23147 │ 22229 │ 20367 │ 19155 │
├───────────┼────────────┼───────┼───────┼───────┼───────┼───────┤
│ no buffer │ 30725 │ 31090 │ 30917 │ 30796 │ 30984 │ 30813 │
└───────────┴────────────┴───────┴───────┴───────┴───────┴───────┘

3- row size = half of the rows are 1KB and rest is 1MB , # of rows = 1000
┌───────────┬────────────┬──────┬──────┬──────┬──────┬──────┐
│ │ 1400 bytes │ 2KB │ 4KB │ 8KB │ 16KB │ 32KB │
├───────────┼────────────┼──────┼──────┼──────┼──────┼──────┤
│ HEAD │ 4296 │ 3713 │ 3040 │ 2711 │ 2528 │ 2449 │
├───────────┼────────────┼──────┼──────┼──────┼──────┼──────┤
│ patch │ 2401 │ 2411 │ 2404 │ 2374 │ 2395 │ 2408 │
├───────────┼────────────┼──────┼──────┼──────┼──────┼──────┤
│ no buffer │ 2399 │ 2403 │ 2408 │ 2389 │ 2402 │ 2403 │
└───────────┴────────────┴──────┴──────┴──────┴──────┴──────┘

4- row size = all rows are 1MB , # of rows = 1000
┌───────────┬────────────┬──────┬──────┬──────┬──────┬──────┐
│ │ 1400 bytes │ 2KB │ 4KB │ 8KB │ 16KB │ 32KB │
├───────────┼────────────┼──────┼──────┼──────┼──────┼──────┤
│ HEAD │ 8335 │ 7370 │ 6017 │ 5368 │ 5009 │ 4843 │
├───────────┼────────────┼──────┼──────┼──────┼──────┼──────┤
│ patch │ 4711 │ 4722 │ 4708 │ 4693 │ 4724 │ 4717 │
├───────────┼────────────┼──────┼──────┼──────┼──────┼──────┤
│ no buffer │ 4704 │ 4712 │ 4746 │ 4728 │ 4709 │ 4730 │
└───────────┴────────────┴──────┴──────┴──────┴──────┴──────┘

[1]: /messages/by-id/CAGECzQQMktuTj8ijJgBRXCwLEqfJyAFxg1h7rCTej-6=cR0r=Q@mail.gmail.com
/messages/by-id/CAGECzQQMktuTj8ijJgBRXCwLEqfJyAFxg1h7rCTej-6=cR0r=Q@mail.gmail.com

Thanks,
--
Melih Mutlu
Microsoft

Attachments:

v3-0001-Flush-large-data-immediately-in-pqcomm.patchapplication/octet-stream; name=v3-0001-Flush-large-data-immediately-in-pqcomm.patchDownload

From ae73dd46b099cf7c21e4f1c725d12bfd4f3e45b1 Mon Sep 17 00:00:00 2001
From: Melih Mutlu <m.melihmutlu@gmail.com>
Date: Mon, 20 Nov 2023 11:20:52 +0300
Subject: [PATCH v3] Flush large data immediately in pqcomm

If the data is larger than send buffer size in pqcomm, we're sure that
the send buffer will be flushed at least once to fit the data depending
on how large the data is.

Instead of repeatedly calling memcpy and then flushing data larger than
the available space in the send buffer, this patch changes
internal_putbytes() logic to flush large data immediately if the buffer
is empty without unnecessarily copying it into the pqcomm send buffer.
---
 src/backend/libpq/pqcomm.c | 61 +++++++++++++++++++++++++++++---------
 1 file changed, 47 insertions(+), 14 deletions(-)

diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c
index 6497100a1a..f663a0cfdd 100644
--- a/src/backend/libpq/pqcomm.c
+++ b/src/backend/libpq/pqcomm.c
@@ -120,8 +120,8 @@ static List *sock_paths = NIL;
 
 static char *PqSendBuffer;
 static int	PqSendBufferSize;	/* Size send buffer */
-static int	PqSendPointer;		/* Next index to store a byte in PqSendBuffer */
-static int	PqSendStart;		/* Next index to send a byte in PqSendBuffer */
+static size_t PqSendPointer;		/* Next index to store a byte in PqSendBuffer */
+static size_t PqSendStart;		/* Next index to send a byte in PqSendBuffer */
 
 static char PqRecvBuffer[PQ_RECV_BUFFER_SIZE];
 static int	PqRecvPointer;		/* Next index to read a byte from PqRecvBuffer */
@@ -145,6 +145,7 @@ static int	socket_putmessage(char msgtype, const char *s, size_t len);
 static void socket_putmessage_noblock(char msgtype, const char *s, size_t len);
 static int	internal_putbytes(const char *s, size_t len);
 static int	internal_flush(void);
+static int	internal_flush_buffer(const char *s, size_t *start, size_t *end);
 
 static int	Lock_AF_UNIX(const char *unixSocketDir, const char *unixSocketPath);
 static int	Setup_AF_UNIX(const char *sock_path);
@@ -1282,14 +1283,32 @@ internal_putbytes(const char *s, size_t len)
 			if (internal_flush())
 				return EOF;
 		}
-		amount = PqSendBufferSize - PqSendPointer;
-		if (amount > len)
-			amount = len;
-		memcpy(PqSendBuffer + PqSendPointer, s, amount);
-		PqSendPointer += amount;
-		s += amount;
-		len -= amount;
+
+		/*
+		 * If the buffer is empty and data length is larger than the buffer
+		 * size, send it without buffering. Otherwise, put as much data as
+		 * possible into the buffer.
+		 */
+		if (len >= PqSendBufferSize && PqSendStart == PqSendPointer)
+		{
+			size_t start = 0;
+
+			socket_set_nonblocking(false);
+			if (internal_flush_buffer(s, &start, &len))
+				return EOF;
+		}
+		else
+		{
+			amount = PqSendBufferSize - PqSendPointer;
+			if (amount > len)
+				amount = len;
+			memcpy(PqSendBuffer + PqSendPointer, s, amount);
+			PqSendPointer += amount;
+			s += amount;
+			len -= amount;
+		}
 	}
+
 	return 0;
 }
 
@@ -1323,11 +1342,25 @@ socket_flush(void)
  */
 static int
 internal_flush(void)
+{
+	/* flush the pending output from send buffer. */
+	return internal_flush_buffer(PqSendBuffer, &PqSendStart, &PqSendPointer);
+}
+
+/* --------------------------------
+ *		internal_flush_buffer - flush the given buffer content
+ *
+ * Returns 0 if OK (meaning everything was sent, or operation would block
+ * and the socket is in non-blocking mode), or EOF if trouble.
+ * --------------------------------
+ */
+static int
+internal_flush_buffer(const char *s, size_t *start, size_t *end)
 {
 	static int	last_reported_send_errno = 0;
 
-	char	   *bufptr = PqSendBuffer + PqSendStart;
-	char	   *bufend = PqSendBuffer + PqSendPointer;
+	char	   *bufptr = (char*) s + *start;
+	char	   *bufend = (char*) s + *end;
 
 	while (bufptr < bufend)
 	{
@@ -1373,7 +1406,7 @@ internal_flush(void)
 			 * flag that'll cause the next CHECK_FOR_INTERRUPTS to terminate
 			 * the connection.
 			 */
-			PqSendStart = PqSendPointer = 0;
+			*start = *end = 0;
 			ClientConnectionLost = 1;
 			InterruptPending = 1;
 			return EOF;
@@ -1381,10 +1414,10 @@ internal_flush(void)
 
 		last_reported_send_errno = 0;	/* reset after any successful send */
 		bufptr += r;
-		PqSendStart += r;
+		*start += r;
 	}
 
-	PqSendStart = PqSendPointer = 0;
+	*start = *end = 0;
 	return 0;
 }
 
-- 
2.34.1

#31

David Rowley

dgrowleyml@gmail.com

almost 2 years ago

In reply to: Melih Mutlu (#30)

Re: Flushing large data immediately in pqcomm

On Fri, 22 Mar 2024 at 12:46, Melih Mutlu <m.melihmutlu@gmail.com> wrote:

I did all of the above changes and it seems like those resolved the regression issue.

Thanks for adjusting the patch. The numbers do look better, but on
looking at your test.sh script from [1]/messages/by-id/CAGPVpCSX8bTF61ZL9jOgh1AaY3bgsWnQ6J7WmJK4TV0f2LPnJQ@mail.gmail.com, I see:

meson setup --buildtype debug -Dcassert=true
--prefix="$DESTDIR/usr/local/pgsql" $DESTDIR && \

can you confirm if the test was done in debug with casserts on? If
so, it would be much better to have asserts off and have
-Dbuildtype=release.

I'm planning to run some benchmarks tomorrow. My thoughts are that
the patch allows the memcpy() to be skipped without adding any
additional buffer flushes and demonstrates a good performance increase
in various scenarios from doing so. I think that is a satisfactory
goal. If I don't see any issues from reviewing and benchmarking
tomorrow, I'd like to commit this.

Robert, I understand you'd like a bit more from this patch. I'm
wondering if you planning on blocking another committer from going
ahead with this? Or if you have a reason why the current state of the
patch is not a meaningful enough improvement that would justify
possibly not getting any improvements in this area for PG17?

David

[1]: /messages/by-id/CAGPVpCSX8bTF61ZL9jOgh1AaY3bgsWnQ6J7WmJK4TV0f2LPnJQ@mail.gmail.com

#32

Robert Haas

robertmhaas@gmail.com

almost 2 years ago

In reply to: David Rowley (#31)

Re: Flushing large data immediately in pqcomm

On Wed, Mar 27, 2024 at 7:39 AM David Rowley <dgrowleyml@gmail.com> wrote:

Robert, I understand you'd like a bit more from this patch. I'm
wondering if you planning on blocking another committer from going
ahead with this? Or if you have a reason why the current state of the
patch is not a meaningful enough improvement that would justify
possibly not getting any improvements in this area for PG17?

So, I think that the first version of the patch, when it got a big
chunk of data, would just flush whatever was already in the buffer and
then send the rest without copying. The current version, as I
understand it, only does that if the buffer is empty; otherwise, it
copies data as much data as it can into the partially-filled buffer. I
think that change addresses most of my concern about the approach; the
old way could, I believe, lead to an increased total number of flushes
with the right usage pattern, but I don't believe that's possible with
the revised approach. I do kind of wonder whether there is some more
fine-tuning of the approach that would improve things further, but I
realize that we have very limited time to figure this out, and there's
no sense letting the perfect be the enemy of the good.

So in short... no, I don't have big concerns at this point. Melih's
latest benchmarks look fairly promising to me, too.

--
Robert Haas
EDB: http://www.enterprisedb.com

#33

Melih Mutlu

m.melihmutlu@gmail.com

almost 2 years ago

In reply to: David Rowley (#31)

Re: Flushing large data immediately in pqcomm

On Wed, Mar 27, 2024 at 14:39 David Rowley <dgrowleyml@gmail.com> wrote:

On Fri, 22 Mar 2024 at 12:46, Melih Mutlu <m.melihmutlu@gmail.com> wrote:

I did all of the above changes and it seems like those resolved the

regression issue.

Thanks for adjusting the patch. The numbers do look better, but on
looking at your test.sh script from [1], I see:

meson setup --buildtype debug -Dcassert=true
--prefix="$DESTDIR/usr/local/pgsql" $DESTDIR && \

can you confirm if the test was done in debug with casserts on? If
so, it would be much better to have asserts off and have
-Dbuildtype=release.

Yes, previous numbers were with --buildtype debug -Dcassert=true. I can
share new numbers with release build and asserts off soon.

Thanks,
Melih

#34

Melih Mutlu

m.melihmutlu@gmail.com

almost 2 years ago

In reply to: Robert Haas (#32)

Re: Flushing large data immediately in pqcomm

On Wed, Mar 27, 2024 at 18:54 Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Mar 27, 2024 at 7:39 AM David Rowley <dgrowleyml@gmail.com> wrote:

Robert, I understand you'd like a bit more from this patch. I'm
wondering if you planning on blocking another committer from going
ahead with this? Or if you have a reason why the current state of the
patch is not a meaningful enough improvement that would justify
possibly not getting any improvements in this area for PG17?

So, I think that the first version of the patch, when it got a big
chunk of data, would just flush whatever was already in the buffer and
then send the rest without copying.

Correct.

The current version, as I

understand it, only does that if the buffer is empty; otherwise, it
copies data as much data as it can into the partially-filled buffer.

Yes, currently it should fill and flush the buffer first, if it’s not
already empty. Only then it sends the rest without copying.

Thanks,
Melih

#35

Melih Mutlu

m.melihmutlu@gmail.com

almost 2 years ago

In reply to: Melih Mutlu (#33)

1 attachment(s)

Re: Flushing large data immediately in pqcomm

Hi,

Melih Mutlu <m.melihmutlu@gmail.com>, 28 Mar 2024 Per, 22:44 tarihinde şunu
yazdı:

On Wed, Mar 27, 2024 at 14:39 David Rowley <dgrowleyml@gmail.com> wrote:

On Fri, 22 Mar 2024 at 12:46, Melih Mutlu <m.melihmutlu@gmail.com> wrote:
can you confirm if the test was done in debug with casserts on? If
so, it would be much better to have asserts off and have
-Dbuildtype=release.

Yes, previous numbers were with --buildtype debug -Dcassert=true. I can

share new numbers with release build and asserts off soon.

While testing the patch without --buildtype debug -Dcassert=true, I felt
like there was still a slight regression. I changed internal_flush() to an
inline function, results look better this way.

1- row size = 100 bytes, # of rows = 1000000
┌───────────┬────────────┬───────┬───────┬───────┬───────┬───────┐
│ │ 1400 bytes │ 2KB │ 4KB │ 8KB │ 16KB │ 32KB │
├───────────┼────────────┼───────┼───────┼───────┼───────┼───────┤
│ HEAD │ 861 │ 765 │ 612 │ 521 │ 477 │ 480 │
├───────────┼────────────┼───────┼───────┼───────┼───────┼───────┤
│ patch │ 869 │ 766 │ 612 │ 519 │ 482 │ 467 │
├───────────┼────────────┼───────┼───────┼───────┼───────┼───────┤
│ no buffer │ 13978 │ 13746 │ 13909 │ 13956 │ 13920 │ 13895 │
└───────────┴────────────┴───────┴───────┴───────┴───────┴───────┘

2- row size = half of the rows are 1KB and rest is 10KB , # of rows =
1000000
┌───────────┬────────────┬───────┬───────┬───────┬───────┬───────┐
│ │ 1400 bytes │ 2KB │ 4KB │ 8KB │ 16KB │ 32KB │
├───────────┼────────────┼───────┼───────┼───────┼───────┼───────┤
│ HEAD │ 30195 │ 26455 │ 17338 │ 14562 │ 12844 │ 11652 │
├───────────┼────────────┼───────┼───────┼───────┼───────┼───────┤
│ patch │ 14744 │ 15830 │ 15697 │ 14273 │ 12794 │ 11652 │
├───────────┼────────────┼───────┼───────┼───────┼───────┼───────┤
│ no buffer │ 24054 │ 23992 │ 24162 │ 23951 │ 23901 │ 23925 │
└───────────┴────────────┴───────┴───────┴───────┴───────┴───────┘

3- row size = half of the rows are 1KB and rest is 1MB , # of rows = 1000
┌───────────┬────────────┬──────┬──────┬──────┬──────┬──────┐
│ │ 1400 bytes │ 2KB │ 4KB │ 8KB │ 16KB │ 32KB │
├───────────┼────────────┼──────┼──────┼──────┼──────┼──────┤
│ HEAD │ 3546 │ 3029 │ 2373 │ 2032 │ 1873 │ 1806 │
├───────────┼────────────┼──────┼──────┼──────┼──────┼──────┤
│ patch │ 1715 │ 1723 │ 1724 │ 1731 │ 1729 │ 1709 │
├───────────┼────────────┼──────┼──────┼──────┼──────┼──────┤
│ no buffer │ 1749 │ 1748 │ 1742 │ 1744 │ 1757 │ 1744 │
└───────────┴────────────┴──────┴──────┴──────┴──────┴──────┘

4- row size = all rows are 1MB , # of rows = 1000
┌───────────┬────────────┬──────┬──────┬──────┬──────┬──────┐
│ │ 1400 bytes │ 2KB │ 4KB │ 8KB │ 16KB │ 32KB │
├───────────┼────────────┼──────┼──────┼──────┼──────┼──────┤
│ HEAD │ 7089 │ 5987 │ 4697 │ 4048 │ 3737 │ 3523 │
├───────────┼────────────┼──────┼──────┼──────┼──────┼──────┤
│ patch │ 3438 │ 3411 │ 3400 │ 3416 │ 3399 │ 3429 │
├───────────┼────────────┼──────┼──────┼──────┼──────┼──────┤
│ no buffer │ 3432 │ 3432 │ 3416 │ 3424 │ 3378 │ 3429 │
└───────────┴────────────┴──────┴──────┴──────┴──────┴──────┘

Thanks,
--
Melih Mutlu
Microsoft

Attachments:

v4-0001-Flush-large-data-immediately-in-pqcomm.patchapplication/octet-stream; name=v4-0001-Flush-large-data-immediately-in-pqcomm.patchDownload

From ef94fceb7be1217f8f7c0d667a8b0945d5421535 Mon Sep 17 00:00:00 2001
From: Melih Mutlu <m.melihmutlu@gmail.com>
Date: Mon, 20 Nov 2023 11:20:52 +0300
Subject: [PATCH v4] Flush large data immediately in pqcomm

If the data is larger than send buffer size in pqcomm, we're sure that
the send buffer will be flushed at least once to fit the data depending
on how large the data is.

Instead of repeatedly calling memcpy and then flushing data larger than
the available space in the send buffer, this patch changes
internal_putbytes() logic to flush large data immediately if the buffer
is empty without unnecessarily copying it into the pqcomm send buffer.
---
 src/backend/libpq/pqcomm.c | 65 ++++++++++++++++++++++++++++----------
 1 file changed, 49 insertions(+), 16 deletions(-)

diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c
index 6497100a1a..1bc31f3cb4 100644
--- a/src/backend/libpq/pqcomm.c
+++ b/src/backend/libpq/pqcomm.c
@@ -120,8 +120,8 @@ static List *sock_paths = NIL;
 
 static char *PqSendBuffer;
 static int	PqSendBufferSize;	/* Size send buffer */
-static int	PqSendPointer;		/* Next index to store a byte in PqSendBuffer */
-static int	PqSendStart;		/* Next index to send a byte in PqSendBuffer */
+static size_t PqSendPointer;		/* Next index to store a byte in PqSendBuffer */
+static size_t PqSendStart;		/* Next index to send a byte in PqSendBuffer */
 
 static char PqRecvBuffer[PQ_RECV_BUFFER_SIZE];
 static int	PqRecvPointer;		/* Next index to read a byte from PqRecvBuffer */
@@ -144,7 +144,8 @@ static bool socket_is_send_pending(void);
 static int	socket_putmessage(char msgtype, const char *s, size_t len);
 static void socket_putmessage_noblock(char msgtype, const char *s, size_t len);
 static int	internal_putbytes(const char *s, size_t len);
-static int	internal_flush(void);
+static inline int internal_flush(void);
+static int	internal_flush_buffer(const char *s, size_t *start, size_t *end);
 
 static int	Lock_AF_UNIX(const char *unixSocketDir, const char *unixSocketPath);
 static int	Setup_AF_UNIX(const char *sock_path);
@@ -1282,14 +1283,32 @@ internal_putbytes(const char *s, size_t len)
 			if (internal_flush())
 				return EOF;
 		}
-		amount = PqSendBufferSize - PqSendPointer;
-		if (amount > len)
-			amount = len;
-		memcpy(PqSendBuffer + PqSendPointer, s, amount);
-		PqSendPointer += amount;
-		s += amount;
-		len -= amount;
+
+		/*
+		 * If the buffer is empty and data length is larger than the buffer
+		 * size, send it without buffering. Otherwise, put as much data as
+		 * possible into the buffer.
+		 */
+		if (len >= PqSendBufferSize && PqSendStart == PqSendPointer)
+		{
+			size_t start = 0;
+
+			socket_set_nonblocking(false);
+			if (internal_flush_buffer(s, &start, &len))
+				return EOF;
+		}
+		else
+		{
+			amount = PqSendBufferSize - PqSendPointer;
+			if (amount > len)
+				amount = len;
+			memcpy(PqSendBuffer + PqSendPointer, s, amount);
+			PqSendPointer += amount;
+			s += amount;
+			len -= amount;
+		}
 	}
+
 	return 0;
 }
 
@@ -1321,13 +1340,27 @@ socket_flush(void)
  * and the socket is in non-blocking mode), or EOF if trouble.
  * --------------------------------
  */
-static int
+static inline int
 internal_flush(void)
+{
+	/* flush the pending output from send buffer. */
+	return internal_flush_buffer(PqSendBuffer, &PqSendStart, &PqSendPointer);
+}
+
+/* --------------------------------
+ *		internal_flush_buffer - flush the given buffer content
+ *
+ * Returns 0 if OK (meaning everything was sent, or operation would block
+ * and the socket is in non-blocking mode), or EOF if trouble.
+ * --------------------------------
+ */
+static inline int
+internal_flush_buffer(const char *s, size_t *start, size_t *end)
 {
 	static int	last_reported_send_errno = 0;
 
-	char	   *bufptr = PqSendBuffer + PqSendStart;
-	char	   *bufend = PqSendBuffer + PqSendPointer;
+	char	   *bufptr = (char*) s + *start;
+	char	   *bufend = (char*) s + *end;
 
 	while (bufptr < bufend)
 	{
@@ -1373,7 +1406,7 @@ internal_flush(void)
 			 * flag that'll cause the next CHECK_FOR_INTERRUPTS to terminate
 			 * the connection.
 			 */
-			PqSendStart = PqSendPointer = 0;
+			*start = *end = 0;
 			ClientConnectionLost = 1;
 			InterruptPending = 1;
 			return EOF;
@@ -1381,10 +1414,10 @@ internal_flush(void)
 
 		last_reported_send_errno = 0;	/* reset after any successful send */
 		bufptr += r;
-		PqSendStart += r;
+		*start += r;
 	}
 
-	PqSendStart = PqSendPointer = 0;
+	*start = *end = 0;
 	return 0;
 }
 
-- 
2.34.1

#36

Jelte Fennema-Nio

postgres@jeltef.nl

almost 2 years ago

In reply to: Melih Mutlu (#35)

Re: Flushing large data immediately in pqcomm

On Thu, 4 Apr 2024 at 13:08, Melih Mutlu <m.melihmutlu@gmail.com> wrote:

I changed internal_flush() to an inline function, results look better this way.

It seems you also change internal_flush_buffer to be inline (but only
in the actual function definition, not declaration at the top). I
don't think inlining internal_flush_buffer should be necessary to
avoid the perf regressions, i.e. internal_flush is adding extra
indirection compared to master and is only a single line, so that one
makes sense to inline.

Other than that the code looks good to me.

The new results look great.

One thing that is quite interesting about these results is that
increasing the buffer size results in even better performance (by
quite a bit). I don't think we can easily choose a perfect number, as
it seems to be a trade-off between memory usage and perf. But allowing
people to configure it through a GUC like in your second patchset
would be quite useful I think, especially because larger buffers could
be configured for connections that would benefit most for it (e.g.
replication connections or big COPYs).

I think your add-pq_send_buffer_size-GUC.patch is essentially what we
would need there but it would need some extra changes to actually be
merge-able:
1. needs docs
2. rename PQ_SEND_BUFFER_SIZE (at least make it not UPPER_CASE, but
maybe also remove the PQ_ prefix)
3. It's marked as PGC_USERSET, but there's no logic to grow/shrink it
after initial allocation

#37

Melih Mutlu

m.melihmutlu@gmail.com

almost 2 years ago

In reply to: Jelte Fennema-Nio (#36)

1 attachment(s)

Re: Flushing large data immediately in pqcomm

Jelte Fennema-Nio <postgres@jeltef.nl>, 4 Nis 2024 Per, 16:34 tarihinde
şunu yazdı:

On Thu, 4 Apr 2024 at 13:08, Melih Mutlu <m.melihmutlu@gmail.com> wrote:

I changed internal_flush() to an inline function, results look better

this way.

It seems you also change internal_flush_buffer to be inline (but only
in the actual function definition, not declaration at the top). I
don't think inlining internal_flush_buffer should be necessary to
avoid the perf regressions, i.e. internal_flush is adding extra
indirection compared to master and is only a single line, so that one
makes sense to inline.

Right. It was a mistake, forgot to remove that. Fixed it in v5.

Other than that the code looks good to me.

The new results look great.

One thing that is quite interesting about these results is that
increasing the buffer size results in even better performance (by
quite a bit). I don't think we can easily choose a perfect number, as
it seems to be a trade-off between memory usage and perf. But allowing
people to configure it through a GUC like in your second patchset
would be quite useful I think, especially because larger buffers could
be configured for connections that would benefit most for it (e.g.
replication connections or big COPYs).

I think your add-pq_send_buffer_size-GUC.patch is essentially what we
would need there but it would need some extra changes to actually be
merge-able:
1. needs docs
2. rename PQ_SEND_BUFFER_SIZE (at least make it not UPPER_CASE, but
maybe also remove the PQ_ prefix)
3. It's marked as PGC_USERSET, but there's no logic to grow/shrink it
after initial allocation

I agree that the GUC patch requires more work to be in good shape. I
created that for testing purposes. But if we decide to make the buffer size
customizable, then I'll start polishing up that patch and address your
suggestions.

One case that could benefit from increased COPY performance is table sync
of logical replication. It might make sense letting users to configure
buffer size to speed up table sync. I'm not sure what kind of problems this
GUC would bring though.

Thanks,
--
Melih Mutlu
Microsoft

Attachments:

v5-0001-Flush-large-data-immediately-in-pqcomm.patchapplication/octet-stream; name=v5-0001-Flush-large-data-immediately-in-pqcomm.patchDownload

From 1334c5f529d4d70d1eb99c190b605ddc8dd87e64 Mon Sep 17 00:00:00 2001
From: Melih Mutlu <m.melihmutlu@gmail.com>
Date: Mon, 20 Nov 2023 11:20:52 +0300
Subject: [PATCH v5] Flush large data immediately in pqcomm

If the data is larger than send buffer size in pqcomm, we're sure that
the send buffer will be flushed at least once to fit the data depending
on how large the data is.

Instead of repeatedly calling memcpy and then flushing data larger than
the available space in the send buffer, this patch changes
internal_putbytes() logic to flush large data immediately if the buffer
is empty without unnecessarily copying it into the pqcomm send buffer.
---
 src/backend/libpq/pqcomm.c | 65 ++++++++++++++++++++++++++++----------
 1 file changed, 49 insertions(+), 16 deletions(-)

diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c
index 6497100a1a..f18ddd5660 100644
--- a/src/backend/libpq/pqcomm.c
+++ b/src/backend/libpq/pqcomm.c
@@ -120,8 +120,8 @@ static List *sock_paths = NIL;
 
 static char *PqSendBuffer;
 static int	PqSendBufferSize;	/* Size send buffer */
-static int	PqSendPointer;		/* Next index to store a byte in PqSendBuffer */
-static int	PqSendStart;		/* Next index to send a byte in PqSendBuffer */
+static size_t PqSendPointer;		/* Next index to store a byte in PqSendBuffer */
+static size_t PqSendStart;		/* Next index to send a byte in PqSendBuffer */
 
 static char PqRecvBuffer[PQ_RECV_BUFFER_SIZE];
 static int	PqRecvPointer;		/* Next index to read a byte from PqRecvBuffer */
@@ -144,7 +144,8 @@ static bool socket_is_send_pending(void);
 static int	socket_putmessage(char msgtype, const char *s, size_t len);
 static void socket_putmessage_noblock(char msgtype, const char *s, size_t len);
 static int	internal_putbytes(const char *s, size_t len);
-static int	internal_flush(void);
+static inline int internal_flush(void);
+static int	internal_flush_buffer(const char *s, size_t *start, size_t *end);
 
 static int	Lock_AF_UNIX(const char *unixSocketDir, const char *unixSocketPath);
 static int	Setup_AF_UNIX(const char *sock_path);
@@ -1282,14 +1283,32 @@ internal_putbytes(const char *s, size_t len)
 			if (internal_flush())
 				return EOF;
 		}
-		amount = PqSendBufferSize - PqSendPointer;
-		if (amount > len)
-			amount = len;
-		memcpy(PqSendBuffer + PqSendPointer, s, amount);
-		PqSendPointer += amount;
-		s += amount;
-		len -= amount;
+
+		/*
+		 * If the buffer is empty and data length is larger than the buffer
+		 * size, send it without buffering. Otherwise, put as much data as
+		 * possible into the buffer.
+		 */
+		if (len >= PqSendBufferSize && PqSendStart == PqSendPointer)
+		{
+			size_t start = 0;
+
+			socket_set_nonblocking(false);
+			if (internal_flush_buffer(s, &start, &len))
+				return EOF;
+		}
+		else
+		{
+			amount = PqSendBufferSize - PqSendPointer;
+			if (amount > len)
+				amount = len;
+			memcpy(PqSendBuffer + PqSendPointer, s, amount);
+			PqSendPointer += amount;
+			s += amount;
+			len -= amount;
+		}
 	}
+
 	return 0;
 }
 
@@ -1321,13 +1340,27 @@ socket_flush(void)
  * and the socket is in non-blocking mode), or EOF if trouble.
  * --------------------------------
  */
-static int
+static inline int
 internal_flush(void)
+{
+	/* flush the pending output from send buffer. */
+	return internal_flush_buffer(PqSendBuffer, &PqSendStart, &PqSendPointer);
+}
+
+/* --------------------------------
+ *		internal_flush_buffer - flush the given buffer content
+ *
+ * Returns 0 if OK (meaning everything was sent, or operation would block
+ * and the socket is in non-blocking mode), or EOF if trouble.
+ * --------------------------------
+ */
+static int
+internal_flush_buffer(const char *s, size_t *start, size_t *end)
 {
 	static int	last_reported_send_errno = 0;
 
-	char	   *bufptr = PqSendBuffer + PqSendStart;
-	char	   *bufend = PqSendBuffer + PqSendPointer;
+	char	   *bufptr = (char*) s + *start;
+	char	   *bufend = (char*) s + *end;
 
 	while (bufptr < bufend)
 	{
@@ -1373,7 +1406,7 @@ internal_flush(void)
 			 * flag that'll cause the next CHECK_FOR_INTERRUPTS to terminate
 			 * the connection.
 			 */
-			PqSendStart = PqSendPointer = 0;
+			*start = *end = 0;
 			ClientConnectionLost = 1;
 			InterruptPending = 1;
 			return EOF;
@@ -1381,10 +1414,10 @@ internal_flush(void)
 
 		last_reported_send_errno = 0;	/* reset after any successful send */
 		bufptr += r;
-		PqSendStart += r;
+		*start += r;
 	}
 
-	PqSendStart = PqSendPointer = 0;
+	*start = *end = 0;
 	return 0;
 }
 
-- 
2.34.1

#38

David Rowley

dgrowleyml@gmail.com

almost 2 years ago

In reply to: Melih Mutlu (#37)

2 attachment(s)

Re: Flushing large data immediately in pqcomm

On Fri, 5 Apr 2024 at 03:28, Melih Mutlu <m.melihmutlu@gmail.com> wrote:

Jelte Fennema-Nio <postgres@jeltef.nl>, 4 Nis 2024 Per, 16:34 tarihinde şunu yazdı:

On Thu, 4 Apr 2024 at 13:08, Melih Mutlu <m.melihmutlu@gmail.com> wrote:

I changed internal_flush() to an inline function, results look better this way.

It seems you also change internal_flush_buffer to be inline (but only
in the actual function definition, not declaration at the top). I
don't think inlining internal_flush_buffer should be necessary to
avoid the perf regressions, i.e. internal_flush is adding extra
indirection compared to master and is only a single line, so that one
makes sense to inline.

Right. It was a mistake, forgot to remove that. Fixed it in v5.

I don't see any issues with v5, so based on the performance numbers
shown on this thread for the latest patch, it would make sense to push
it. The problem is, I just can't recreate the performance numbers.

I've tried both on my AMD 3990x machine and an Apple M2 with a script
similar to the test.sh from above. I mostly just stripped out the
buffer size stuff and adjusted the timing code to something that would
work with mac.

The script runs each copy 30 times and takes the average time,
reported here in seconds.

With AMD 3990x:

master
Run 100 100 5000000: 1.032264113 sec
Run 1024 10240 200000: 1.016229105 sec
Run 1024 1048576 2000: 1.242267116 sec
Run 1048576 1048576 1000: 1.245425089 sec

v5
Run 100 100 5000000: 1.068543053 sec
Run 1024 10240 200000: 1.026298571 sec
Run 1024 1048576 2000: 1.231169669 sec
Run 1048576 1048576 1000: 1.236355567 sec

With the M2 mini:

master
Run 100 100 5000000: 1.167851249 sec
Run 1024 10240 200000: 1.962466987 sec
Run 1024 1048576 2000: 2.052836275 sec
Run 1048576 1048576 1000: 2.057908066 sec

v5
Run 100 100 5000000: 1.149636571 sec
Run 1024 10240 200000: 2.158487741 sec
Run 1024 1048576 2000: 2.046627068 sec
Run 1048576 1048576 1000: 2.039329068 sec

From looking at the perf reports, the top function is:

57.62% postgres [.] CopyAttributeOutText

I messed around with trying to speed up the string escaping in that
function with the attached hacky patch and got the following on the
AMD 3990x machine:

CopyAttributeOutText_speedup.patch.txt
Run 100 100 5000000: 0.821673910
Run 1024 10240 200000: 0.546632147
Run 1024 1048576 2000: 0.848492694
Run 1048576 1048576 1000: 0.840870293

I don't think we could actually do this unless we modified the output
function API to have it somehow output the number of bytes. The patch
may look beyond the NUL byte with pg_lfind8, which I don't think is
safe.

Does anyone else want to try the attached script on the v5 patch to
see if their numbers are better?

David

Attachments:

test1.sh.txttext/plain; charset=US-ASCII; name=test1.sh.txtDownload

CopyAttributeOutText_speedup.patch.txttext/plain; charset=US-ASCII; name=CopyAttributeOutText_speedup.patch.txtDownload

diff --git a/src/backend/commands/copyto.c b/src/backend/commands/copyto.c
index ae8b2e36d7..fcb200503f 100644
--- a/src/backend/commands/copyto.c
+++ b/src/backend/commands/copyto.c
@@ -29,6 +29,7 @@
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "port/pg_lfind.h"
 #include "storage/fd.h"
 #include "tcop/tcopprot.h"
 #include "utils/lsyscache.h"
@@ -1068,9 +1069,18 @@ CopyAttributeOutText(CopyToState cstate, const char *string)
 	else
 	{
 		start = ptr;
-		while ((c = *ptr) != '\0')
+
+		while (true)
 		{
-			if ((unsigned char) c < (unsigned char) 0x20)
+			while (!pg_lfind8('\\', (uint8 *) ptr, sizeof(Vector8)) &&
+				   !pg_lfind8(delimc, (uint8 *) ptr, sizeof(Vector8)) &&
+				   !pg_lfind8_le(31, (uint8 *) ptr, sizeof(Vector8)))
+				ptr += sizeof(Vector8);
+
+			if ((c = *ptr) == '\0')
+				break;
+
+			if ((unsigned char) c <= (unsigned char) 31)
 			{
 				/*
 				 * \r and \n must be escaped, the others are traditional. We

#39

Jelte Fennema-Nio

postgres@jeltef.nl

almost 2 years ago

In reply to: David Rowley (#38)

Re: Flushing large data immediately in pqcomm

On Sat, 6 Apr 2024 at 03:34, David Rowley <dgrowleyml@gmail.com> wrote:

Does anyone else want to try the attached script on the v5 patch to
see if their numbers are better?

On my machine (i9-10900X, in Ubuntu 22.04 on WSL on Windows) v5
consistently beats master by ~0.25 seconds:

master:
Run 100 100 5000000: 1.948975205
Run 1024 10240 200000: 3.039986587
Run 1024 1048576 2000: 2.444176276
Run 1048576 1048576 1000: 2.475328596

v5:
Run 100 100 5000000: 1.997170909
Run 1024 10240 200000: 3.057802598
Run 1024 1048576 2000: 2.199449857
Run 1048576 1048576 1000: 2.210328762

The first two runs are pretty much equal, and I ran your script a few
more times and this seems like just random variance (sometimes v5 wins
those, sometimes master does always quite close to each other). But
the last two runs v5 consistently wins.

Weird that on your machines you don't see a difference. Are you sure
you didn't make a silly mistake, like not restarting postgres or
something?

#40

David Rowley

dgrowleyml@gmail.com

almost 2 years ago

In reply to: Jelte Fennema-Nio (#39)

1 attachment(s)

Re: Flushing large data immediately in pqcomm

On Sat, 6 Apr 2024 at 23:17, Jelte Fennema-Nio <postgres@jeltef.nl> wrote:

Weird that on your machines you don't see a difference. Are you sure
you didn't make a silly mistake, like not restarting postgres or
something?

I'm sure. I spent quite a long time between the AMD and an Apple m2 trying.

I did see the same regression as you on the smaller numbers. I
experimented with the attached which macro'ifies internal_flush() and
pg_noinlines internal_flush_buffer.

Can you try that to see if it gets rid of the regression on the first two tests?

David

Attachments:

flush_macro_noinline.patchtext/plain; charset=US-ASCII; name=flush_macro_noinline.patchDownload

diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c
index 6497100a1a..824b2f11a3 100644
--- a/src/backend/libpq/pqcomm.c
+++ b/src/backend/libpq/pqcomm.c
@@ -120,8 +120,8 @@ static List *sock_paths = NIL;
 
 static char *PqSendBuffer;
 static int	PqSendBufferSize;	/* Size send buffer */
-static int	PqSendPointer;		/* Next index to store a byte in PqSendBuffer */
-static int	PqSendStart;		/* Next index to send a byte in PqSendBuffer */
+static size_t PqSendPointer;		/* Next index to store a byte in PqSendBuffer */
+static size_t PqSendStart;		/* Next index to send a byte in PqSendBuffer */
 
 static char PqRecvBuffer[PQ_RECV_BUFFER_SIZE];
 static int	PqRecvPointer;		/* Next index to read a byte from PqRecvBuffer */
@@ -133,6 +133,7 @@ static int	PqRecvLength;		/* End of data available in PqRecvBuffer */
 static bool PqCommBusy;			/* busy sending data to the client */
 static bool PqCommReadingMsg;	/* in the middle of reading a message */
 
+#define internal_flush()	internal_flush_buffer(PqSendBuffer, &PqSendStart, &PqSendPointer)
 
 /* Internal functions */
 static void socket_comm_reset(void);
@@ -144,7 +145,8 @@ static bool socket_is_send_pending(void);
 static int	socket_putmessage(char msgtype, const char *s, size_t len);
 static void socket_putmessage_noblock(char msgtype, const char *s, size_t len);
 static int	internal_putbytes(const char *s, size_t len);
-static int	internal_flush(void);
+static pg_noinline int internal_flush_buffer(const char *s, size_t *start,
+											 size_t *end);
 
 static int	Lock_AF_UNIX(const char *unixSocketDir, const char *unixSocketPath);
 static int	Setup_AF_UNIX(const char *sock_path);
@@ -1282,14 +1284,32 @@ internal_putbytes(const char *s, size_t len)
 			if (internal_flush())
 				return EOF;
 		}
-		amount = PqSendBufferSize - PqSendPointer;
-		if (amount > len)
-			amount = len;
-		memcpy(PqSendBuffer + PqSendPointer, s, amount);
-		PqSendPointer += amount;
-		s += amount;
-		len -= amount;
+
+		/*
+		 * If the buffer is empty and data length is larger than the buffer
+		 * size, send it without buffering. Otherwise, put as much data as
+		 * possible into the buffer.
+		 */
+		if (len >= PqSendBufferSize && PqSendStart == PqSendPointer)
+		{
+			size_t start = 0;
+
+			socket_set_nonblocking(false);
+			if (internal_flush_buffer(s, &start, &len))
+				return EOF;
+		}
+		else
+		{
+			amount = PqSendBufferSize - PqSendPointer;
+			if (amount > len)
+				amount = len;
+			memcpy(PqSendBuffer + PqSendPointer, s, amount);
+			PqSendPointer += amount;
+			s += amount;
+			len -= amount;
+		}
 	}
+
 	return 0;
 }
 
@@ -1315,19 +1335,19 @@ socket_flush(void)
 }
 
 /* --------------------------------
- *		internal_flush - flush pending output
+ *		internal_flush_buffer - flush the given buffer content
  *
  * Returns 0 if OK (meaning everything was sent, or operation would block
  * and the socket is in non-blocking mode), or EOF if trouble.
  * --------------------------------
  */
-static int
-internal_flush(void)
+static pg_noinline int
+internal_flush_buffer(const char *s, size_t *start, size_t *end)
 {
 	static int	last_reported_send_errno = 0;
 
-	char	   *bufptr = PqSendBuffer + PqSendStart;
-	char	   *bufend = PqSendBuffer + PqSendPointer;
+	char	   *bufptr = (char*) s + *start;
+	char	   *bufend = (char*) s + *end;
 
 	while (bufptr < bufend)
 	{
@@ -1373,7 +1393,7 @@ internal_flush(void)
 			 * flag that'll cause the next CHECK_FOR_INTERRUPTS to terminate
 			 * the connection.
 			 */
-			PqSendStart = PqSendPointer = 0;
+			*start = *end = 0;
 			ClientConnectionLost = 1;
 			InterruptPending = 1;
 			return EOF;
@@ -1381,10 +1401,10 @@ internal_flush(void)
 
 		last_reported_send_errno = 0;	/* reset after any successful send */
 		bufptr += r;
-		PqSendStart += r;
+		*start += r;
 	}
 
-	PqSendStart = PqSendPointer = 0;
+	*start = *end = 0;
 	return 0;
 }

#41

Andres Freund

andres@anarazel.de

almost 2 years ago

In reply to: David Rowley (#38)

1 attachment(s)

Re: Flushing large data immediately in pqcomm

Hi,

On 2024-04-06 14:34:17 +1300, David Rowley wrote:

I don't see any issues with v5, so based on the performance numbers
shown on this thread for the latest patch, it would make sense to push
it. The problem is, I just can't recreate the performance numbers.

I've tried both on my AMD 3990x machine and an Apple M2 with a script
similar to the test.sh from above. I mostly just stripped out the
buffer size stuff and adjusted the timing code to something that would
work with mac.

I think there are a few issues with the test script leading to not seeing a
gain:

1) I think using the textual protocol, with the text datatype, will make it
harder to spot differences. That's a lot of overhead.

2) Afaict the test is connecting over the unix socket, I think we expect
bigger wins for tcp

3) Particularly the larger string is bottlenecked due to pglz compression in
toast.

Where I had noticed the overhead of the current approach badly, was streaming
out basebackups. Which is all binary, of course.

I added WITH BINARY, SET STORAGE EXTERNAL and tested both unix socket and
localhost. I also reduced row counts and iteration counts, because I am
impatient, and I don't think it matters much here. Attached the modified
version.

On a dual xeon Gold 5215, turbo boost disabled, server pinned to one core,
script pinned to another:

unix:

master:
Run 100 100 1000000: 0.058482377
Run 1024 10240 100000: 0.120909810
Run 1024 1048576 2000: 0.153027916
Run 1048576 1048576 1000: 0.154953512

v5:
Run 100 100 1000000: 0.058760126
Run 1024 10240 100000: 0.118831396
Run 1024 1048576 2000: 0.124282503
Run 1048576 1048576 1000: 0.123894962

localhost:

master:
Run 100 100 1000000: 0.067088000
Run 1024 10240 100000: 0.170894273
Run 1024 1048576 2000: 0.230346632
Run 1048576 1048576 1000: 0.230336078

v5:
Run 100 100 1000000: 0.067144036
Run 1024 10240 100000: 0.167950948
Run 1024 1048576 2000: 0.135167027
Run 1048576 1048576 1000: 0.135347867

The perf difference for 1MB via TCP is really impressive.

The small regression for small results is still kinda visible, I haven't yet
tested the patch downthread.

Greetings,

Andres Freund

#42

Ranier Vilela

ranier.vf@gmail.com

almost 2 years ago

In reply to: Andres Freund (#41)

Re: Flushing large data immediately in pqcomm

Hi,

On Fri, 5 Apr 2024 at 03:28, Melih Mutlu
<m(dot)melihmutlu(at)gmail(dot)com> wrote:

Right. It was a mistake, forgot to remove that. Fixed it in v5.

If you don't mind, I have some suggestions for patch v5.

1. Shouldn't PqSendBufferSize be of type size_t?
There are several comparisons with other size_t variables.
static size_t PqSendBufferSize; /* Size send buffer */

I think this would prevent possible overflows.

2. If PqSendBufferSize is changed to size_t, in the function
socket_putmessage_noblock, the variable which name is *required*, should
be changed to size_t as well.

static void
socket_putmessage_noblock(char msgtype, const char *s, size_t len)
{
int res PG_USED_FOR_ASSERTS_ONLY;
size_t required;

3. In the internal_putbytes function, the *amout* variable could
have the scope safely reduced.

else
{
size_t amount;

amount = PqSendBufferSize - PqSendPointer;

4. In the function internal_flush_buffer, the variables named
*bufptr* and *bufend* could be const char * type, like:

static int
internal_flush_buffer(const char *s, size_t *start, size_t *end)
{
static int last_reported_send_errno = 0;

const char *bufptr = s + *start;
const char *bufend = s + *end;

best regards,
Ranier Vilela

Import Notes

Resolved by subject fallback

#43

Jelte Fennema-Nio

postgres@jeltef.nl

almost 2 years ago

In reply to: Andres Freund (#41)

1 attachment(s)

Re: Flushing large data immediately in pqcomm

On Sat, 6 Apr 2024 at 22:21, Andres Freund <andres@anarazel.de> wrote:

The small regression for small results is still kinda visible, I haven't yet
tested the patch downthread.

Thanks a lot for the faster test script, I'm also impatient. I still
saw the small regression with David his patch. Here's a v6 where I
think it is now gone. I added inline to internal_put_bytes too. I
think that helped especially because for two calls to
internal_put_bytes len is a constant (1 and 4) that is smaller than
PqSendBufferSize. So for those calls the compiler can now statically
eliminate the new codepath because "len >= PqSendBufferSize" is known
to be false at compile time.

Also I incorporated all of Ranier his comments.

Attachments:

v6-0001-Faster-internal_putbytes.patchapplication/octet-stream; name=v6-0001-Faster-internal_putbytes.patchDownload

From a25bd7013ba491ebc894f5c5dfba3a751ab501c0 Mon Sep 17 00:00:00 2001
From: Jelte Fennema-Nio <jelte.fennema@microsoft.com>
Date: Sat, 6 Apr 2024 19:19:28 +0200
Subject: [PATCH v6] Faster internal_putbytes

---
 src/backend/libpq/pqcomm.c | 69 ++++++++++++++++++++++++--------------
 1 file changed, 44 insertions(+), 25 deletions(-)

diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c
index 6497100a1a4..df40ef0a5e3 100644
--- a/src/backend/libpq/pqcomm.c
+++ b/src/backend/libpq/pqcomm.c
@@ -119,9 +119,9 @@ static List *sock_paths = NIL;
 #define PQ_RECV_BUFFER_SIZE 8192
 
 static char *PqSendBuffer;
-static int	PqSendBufferSize;	/* Size send buffer */
-static int	PqSendPointer;		/* Next index to store a byte in PqSendBuffer */
-static int	PqSendStart;		/* Next index to send a byte in PqSendBuffer */
+static size_t PqSendBufferSize; /* Size send buffer */
+static size_t PqSendPointer;	/* Next index to store a byte in PqSendBuffer */
+static size_t PqSendStart;		/* Next index to send a byte in PqSendBuffer */
 
 static char PqRecvBuffer[PQ_RECV_BUFFER_SIZE];
 static int	PqRecvPointer;		/* Next index to read a byte from PqRecvBuffer */
@@ -133,6 +133,7 @@ static int	PqRecvLength;		/* End of data available in PqRecvBuffer */
 static bool PqCommBusy;			/* busy sending data to the client */
 static bool PqCommReadingMsg;	/* in the middle of reading a message */
 
+#define internal_flush()	internal_flush_buffer(PqSendBuffer, &PqSendStart, &PqSendPointer)
 
 /* Internal functions */
 static void socket_comm_reset(void);
@@ -143,8 +144,9 @@ static int	socket_flush_if_writable(void);
 static bool socket_is_send_pending(void);
 static int	socket_putmessage(char msgtype, const char *s, size_t len);
 static void socket_putmessage_noblock(char msgtype, const char *s, size_t len);
-static int	internal_putbytes(const char *s, size_t len);
-static int	internal_flush(void);
+static inline int internal_putbytes(const char *s, size_t len);
+static pg_noinline int internal_flush_buffer(const char *s, size_t *start,
+											 size_t *end);
 
 static int	Lock_AF_UNIX(const char *unixSocketDir, const char *unixSocketPath);
 static int	Setup_AF_UNIX(const char *sock_path);
@@ -1268,11 +1270,9 @@ pq_getmessage(StringInfo s, int maxlen)
 }
 
 
-static int
+static inline int
 internal_putbytes(const char *s, size_t len)
 {
-	size_t		amount;
-
 	while (len > 0)
 	{
 		/* If buffer is full, then flush it out */
@@ -1282,14 +1282,33 @@ internal_putbytes(const char *s, size_t len)
 			if (internal_flush())
 				return EOF;
 		}
-		amount = PqSendBufferSize - PqSendPointer;
-		if (amount > len)
-			amount = len;
-		memcpy(PqSendBuffer + PqSendPointer, s, amount);
-		PqSendPointer += amount;
-		s += amount;
-		len -= amount;
+
+		/*
+		 * If the buffer is empty and data length is larger than the buffer
+		 * size, send it without buffering. Otherwise, put as much data as
+		 * possible into the buffer.
+		 */
+		if (len >= PqSendBufferSize && PqSendStart == PqSendPointer)
+		{
+			size_t		start = 0;
+
+			socket_set_nonblocking(false);
+			if (internal_flush_buffer(s, &start, &len))
+				return EOF;
+		}
+		else
+		{
+			size_t		amount = PqSendBufferSize - PqSendPointer;
+
+			if (amount > len)
+				amount = len;
+			memcpy(PqSendBuffer + PqSendPointer, s, amount);
+			PqSendPointer += amount;
+			s += amount;
+			len -= amount;
+		}
 	}
+
 	return 0;
 }
 
@@ -1315,25 +1334,25 @@ socket_flush(void)
 }
 
 /* --------------------------------
- *		internal_flush - flush pending output
+ *		internal_flush_buffer - flush the given buffer content
  *
  * Returns 0 if OK (meaning everything was sent, or operation would block
  * and the socket is in non-blocking mode), or EOF if trouble.
  * --------------------------------
  */
-static int
-internal_flush(void)
+static pg_noinline int
+internal_flush_buffer(const char *s, size_t *start, size_t *end)
 {
 	static int	last_reported_send_errno = 0;
 
-	char	   *bufptr = PqSendBuffer + PqSendStart;
-	char	   *bufend = PqSendBuffer + PqSendPointer;
+	const char *bufptr = s + *start;
+	const char *bufend = s + *end;
 
 	while (bufptr < bufend)
 	{
 		int			r;
 
-		r = secure_write(MyProcPort, bufptr, bufend - bufptr);
+		r = secure_write(MyProcPort, (char *) bufptr, bufend - bufptr);
 
 		if (r <= 0)
 		{
@@ -1373,7 +1392,7 @@ internal_flush(void)
 			 * flag that'll cause the next CHECK_FOR_INTERRUPTS to terminate
 			 * the connection.
 			 */
-			PqSendStart = PqSendPointer = 0;
+			*start = *end = 0;
 			ClientConnectionLost = 1;
 			InterruptPending = 1;
 			return EOF;
@@ -1381,10 +1400,10 @@ internal_flush(void)
 
 		last_reported_send_errno = 0;	/* reset after any successful send */
 		bufptr += r;
-		PqSendStart += r;
+		*start += r;
 	}
 
-	PqSendStart = PqSendPointer = 0;
+	*start = *end = 0;
 	return 0;
 }
 
@@ -1487,7 +1506,7 @@ static void
 socket_putmessage_noblock(char msgtype, const char *s, size_t len)
 {
 	int			res PG_USED_FOR_ASSERTS_ONLY;
-	int			required;
+	size_t		required;
 
 	/*
 	 * Ensure we have enough space in the output buffer for the message header

base-commit: f956ecd0353b2960f8322b2211142113fe2b6f67
-- 
2.34.1

#44

Andres Freund

andres@anarazel.de

almost 2 years ago

In reply to: Jelte Fennema-Nio (#43)

Re: Flushing large data immediately in pqcomm

Hi,

On 2024-04-07 00:45:31 +0200, Jelte Fennema-Nio wrote:

On Sat, 6 Apr 2024 at 22:21, Andres Freund <andres@anarazel.de> wrote:

The small regression for small results is still kinda visible, I haven't yet
tested the patch downthread.

Thanks a lot for the faster test script, I'm also impatient. I still
saw the small regression with David his patch. Here's a v6 where I
think it is now gone. I added inline to internal_put_bytes too. I
think that helped especially because for two calls to
internal_put_bytes len is a constant (1 and 4) that is smaller than
PqSendBufferSize. So for those calls the compiler can now statically
eliminate the new codepath because "len >= PqSendBufferSize" is known
to be false at compile time.

Nice.

Also I incorporated all of Ranier his comments.

Changing the global vars to size_t seems mildly bogus to me. All it's
achieving is to use slightly more memory. It also just seems unrelated to the
change.

Greetings,

Andres Freund

#45

David Rowley

dgrowleyml@gmail.com

almost 2 years ago

In reply to: Andres Freund (#41)

Re: Flushing large data immediately in pqcomm

On Sun, 7 Apr 2024 at 08:21, Andres Freund <andres@anarazel.de> wrote:

I added WITH BINARY, SET STORAGE EXTERNAL and tested both unix socket and
localhost. I also reduced row counts and iteration counts, because I am
impatient, and I don't think it matters much here. Attached the modified
version.

Thanks for the script. I'm able to reproduce the speedup with your script.

I looked over the patch again and ended up making internal_flush an
inline function rather than a macro. I compared the assembly produced
from each and it's the same with the exception of the label names
being different.

I've now pushed the patch.

One thing that does not seem ideal is having to cast away the
const-ness of the buffer in internal_flush_buffer(). Previously this
wasn't an issue as we always copied the buffer and passed that to
secure_write(). I wonder if it's worth seeing if we can keep this
buffer constant all the way to the socket write.

That seems to require modifying the following function signatures:
secure_write(), be_tls_write(), be_gssapi_write(). That's not an area
I'm familiar with, however.

David

#46

Jelte Fennema-Nio

postgres@jeltef.nl

almost 2 years ago

In reply to: Andres Freund (#44)

1 attachment(s)

Re: Flushing large data immediately in pqcomm

On Sun, 7 Apr 2024 at 03:39, Andres Freund <andres@anarazel.de> wrote:

Changing the global vars to size_t seems mildly bogus to me. All it's
achieving is to use slightly more memory. It also just seems unrelated to the
change.

I took a closer look at this. I agree that changing PqSendBufferSize
to size_t is unnecessary: given the locations that it is used I see no
risk of overflow anywhere. Changing the type of PqSendPointer and
PqSendStart is needed though, because (as described by Heiki and David
upthread) the argument type of internal_flush_buffer is size_t*. So if
you actually pass int* there, and the sizes are not the same then you
will start writing out of bounds. And because internal_flush_buffer is
introduced in this patch, it is related to this change.

This is what David just committed too.

However, the "required" var actually should be of size_t to avoid
overflow if len is larger than int even without this change. So
attached is a tiny patch that does that.

Attachments:

v7-0001-Avoid-possible-overflow-in-socket_putmessage_nonb.patchapplication/octet-stream; name=v7-0001-Avoid-possible-overflow-in-socket_putmessage_nonb.patchDownload

From e0320c575f572e7cf6cee8ed4e17016d5ba76901 Mon Sep 17 00:00:00 2001
From: Jelte Fennema-Nio <jelte.fennema@microsoft.com>
Date: Sun, 7 Apr 2024 10:31:28 +0200
Subject: [PATCH v7 1/2] Avoid possible overflow in socket_putmessage_nonblock

On systems where int consists of fewer bits than size_t there was a
possibility for overflow in socket_putmessage_nonblock, because of an
intermediate cast from size_t to int.
---
 src/backend/libpq/pqcomm.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c
index 6497100a1a4..14186a3e065 100644
--- a/src/backend/libpq/pqcomm.c
+++ b/src/backend/libpq/pqcomm.c
@@ -1487,7 +1487,7 @@ static void
 socket_putmessage_noblock(char msgtype, const char *s, size_t len)
 {
 	int			res PG_USED_FOR_ASSERTS_ONLY;
-	int			required;
+	size_t		required;
 
 	/*
 	 * Ensure we have enough space in the output buffer for the message header

base-commit: a97bbe1f1df9eba0b18207c321c67de80b33db16
-- 
2.34.1

#47

David Rowley

dgrowleyml@gmail.com

almost 2 years ago

In reply to: Jelte Fennema-Nio (#46)

Re: Flushing large data immediately in pqcomm

On Sun, 7 Apr 2024 at 22:05, Jelte Fennema-Nio <postgres@jeltef.nl> wrote:

On Sun, 7 Apr 2024 at 03:39, Andres Freund <andres@anarazel.de> wrote:

Changing the global vars to size_t seems mildly bogus to me. All it's
achieving is to use slightly more memory. It also just seems unrelated to the
change.

I took a closer look at this. I agree that changing PqSendBufferSize
to size_t is unnecessary: given the locations that it is used I see no
risk of overflow anywhere. Changing the type of PqSendPointer and
PqSendStart is needed though, because (as described by Heiki and David
upthread) the argument type of internal_flush_buffer is size_t*. So if
you actually pass int* there, and the sizes are not the same then you
will start writing out of bounds. And because internal_flush_buffer is
introduced in this patch, it is related to this change.

This is what David just committed too.

However, the "required" var actually should be of size_t to avoid
overflow if len is larger than int even without this change. So
attached is a tiny patch that does that.

Looking at the code in socket_putmessage_noblock(), I don't understand
why it's ok for PqSendBufferSize to be int but "required" must be
size_t. There's a line that does "PqSendBufferSize = required;". It
kinda looks like they both should be size_t. Am I missing something
that you've thought about?

David

#48

Ranier Vilela

ranier.vf@gmail.com

almost 2 years ago

In reply to: Andres Freund (#44)

Re: Flushing large data immediately in pqcomm

Em sáb., 6 de abr. de 2024 às 22:39, Andres Freund <andres@anarazel.de>
escreveu:

Hi,

On 2024-04-07 00:45:31 +0200, Jelte Fennema-Nio wrote:

On Sat, 6 Apr 2024 at 22:21, Andres Freund <andres@anarazel.de> wrote:

The small regression for small results is still kinda visible, I

haven't yet

tested the patch downthread.

Thanks a lot for the faster test script, I'm also impatient. I still
saw the small regression with David his patch. Here's a v6 where I
think it is now gone. I added inline to internal_put_bytes too. I
think that helped especially because for two calls to
internal_put_bytes len is a constant (1 and 4) that is smaller than
PqSendBufferSize. So for those calls the compiler can now statically
eliminate the new codepath because "len >= PqSendBufferSize" is known
to be false at compile time.

Nice.

Also I incorporated all of Ranier his comments.

Changing the global vars to size_t seems mildly bogus to me. All it's
achieving is to use slightly more memory. It also just seems unrelated to
the
change.

I don't agree with this thought.
Actually size_t uses 4 bytes of memory than int, right.
But mixing up int and size_t is a sure way to write non-portable code.
And the compilers will start showing messages such as " signed/unsigned
mismatch".

The global vars PqSendPointer and PqSendStart were changed in the v5 patch,
so for the sake of style and consistency, I understand that it is better
not to mix the types.

The compiler will promote PqSendBufferSize to size_t in all comparisons.

And finally the correct type to deal with char * variables is size_t.

Best regards,
Ranier Vilela

Show quoted text

Greetings,

Andres Freund

#49

Melih Mutlu

m.melihmutlu@gmail.com

almost 2 years ago

In reply to: David Rowley (#38)

Re: Flushing large data immediately in pqcomm

David Rowley <dgrowleyml@gmail.com>, 6 Nis 2024 Cmt, 04:34 tarihinde şunu
yazdı:

Does anyone else want to try the attached script on the v5 patch to
see if their numbers are better?

I'm seeing the below results with your script on my machine (). I ran it
several times, results were almost similar each time.

master:
Run 100 100 5000000: 1.627905512
Run 1024 10240 200000: 1.603231684
Run 1024 1048576 2000: 2.962812352
Run 1048576 1048576 1000: 2.940766748

v5:
Run 100 100 5000000: 1.611508155
Run 1024 10240 200000: 1.603505596
Run 1024 1048576 2000: 2.727241937
Run 1048576 1048576 1000: 2.721268988

#50

Jelte Fennema-Nio

postgres@jeltef.nl

almost 2 years ago

In reply to: David Rowley (#47)

1 attachment(s)

Re: Flushing large data immediately in pqcomm

On Sun, 7 Apr 2024 at 14:41, David Rowley <dgrowleyml@gmail.com> wrote:

Looking at the code in socket_putmessage_noblock(), I don't understand
why it's ok for PqSendBufferSize to be int but "required" must be
size_t. There's a line that does "PqSendBufferSize = required;". It
kinda looks like they both should be size_t. Am I missing something
that you've thought about?

You and Ranier are totally right (I missed this assignment). Attached is v8.

Attachments:

v8-0001-Make-a-few-variables-size_t-in-pqcomm.c.patchapplication/octet-stream; name=v8-0001-Make-a-few-variables-size_t-in-pqcomm.c.patchDownload

From def294980e01ca412f83fdfcd0b37d27726f2c0f Mon Sep 17 00:00:00 2001
From: Jelte Fennema-Nio <jelte.fennema@microsoft.com>
Date: Mon, 8 Apr 2024 12:40:01 +0200
Subject: [PATCH v8] Make a few variables size_t in pqcomm.c

This is needed since c4ab7da6061 to avoid overflows in case int is
smaller than size_t.

Reported-By: Ranier Vilela
---
 src/backend/libpq/pqcomm.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c
index 2cee49a2085..af43eef2eee 100644
--- a/src/backend/libpq/pqcomm.c
+++ b/src/backend/libpq/pqcomm.c
@@ -119,7 +119,7 @@ static List *sock_paths = NIL;
 #define PQ_RECV_BUFFER_SIZE 8192
 
 static char *PqSendBuffer;
-static int	PqSendBufferSize;	/* Size send buffer */
+static size_t PqSendBufferSize; /* Size send buffer */
 static size_t PqSendPointer;	/* Next index to store a byte in PqSendBuffer */
 static size_t PqSendStart;		/* Next index to send a byte in PqSendBuffer */
 
@@ -1521,7 +1521,7 @@ static void
 socket_putmessage_noblock(char msgtype, const char *s, size_t len)
 {
 	int			res PG_USED_FOR_ASSERTS_ONLY;
-	int			required;
+	size_t		required;
 
 	/*
 	 * Ensure we have enough space in the output buffer for the message header

base-commit: 422041542f313f23ca66cad26e9b2b99c4d1999a
-- 
2.34.1

#51

Ranier Vilela

ranier.vf@gmail.com

almost 2 years ago

In reply to: Jelte Fennema-Nio (#50)

Re: Flushing large data immediately in pqcomm

Em seg., 8 de abr. de 2024 às 07:42, Jelte Fennema-Nio <postgres@jeltef.nl>
escreveu:

On Sun, 7 Apr 2024 at 14:41, David Rowley <dgrowleyml@gmail.com> wrote:

Looking at the code in socket_putmessage_noblock(), I don't understand
why it's ok for PqSendBufferSize to be int but "required" must be
size_t. There's a line that does "PqSendBufferSize = required;". It
kinda looks like they both should be size_t. Am I missing something
that you've thought about?

You and Ranier are totally right (I missed this assignment). Attached is
v8.

+1
LGTM.

best regards,
Ranier Vilela

#52

Jelte Fennema-Nio

postgres@jeltef.nl

almost 2 years ago

In reply to: Jelte Fennema-Nio (#50)

3 attachment(s)

Re: Flushing large data immediately in pqcomm

On Sun, 7 Apr 2024 at 11:34, David Rowley <dgrowleyml@gmail.com> wrote:

That seems to require modifying the following function signatures:
secure_write(), be_tls_write(), be_gssapi_write(). That's not an area
I'm familiar with, however.

Attached is a new patchset where 0003 does exactly that. The only
place where we need to cast to non-const is for GSS, but that seems
fine (commit message has more details).

I also added patch 0002, which is a small addition to the function
comment of internal_flush_buffer that seemed useful to me to
differentiate it from internal_flush (feel free to ignore/rewrite).

Attachments:

v9-0003-Make-backend-libpq-write-functions-take-const-poi.patchapplication/octet-stream; name=v9-0003-Make-backend-libpq-write-functions-take-const-poi.patchDownload

From e29e44cd9504927458295f694cf4295d073f955e Mon Sep 17 00:00:00 2001
From: Jelte Fennema-Nio <jelte.fennema@microsoft.com>
Date: Mon, 8 Apr 2024 14:10:09 +0200
Subject: [PATCH v9 3/3] Make backend libpq write functions take const pointer

Since c4ab7da6061 we're now passing buffers that shouldn't be modified
to secure_write. This marks all involved functions as taking const
pointers to indicate that that is fine. We were not changing the
contents of this buffer anyway.

The only place where we need to cast it to a non-const pointer is in
be_gssapi_write, but that simply seems an artifact of the GSS API. GSS
has its own buffer type, which is used both for const and non-const
buffers.
---
 src/backend/libpq/be-secure-gssapi.c  | 2 +-
 src/backend/libpq/be-secure-openssl.c | 2 +-
 src/backend/libpq/be-secure.c         | 2 +-
 src/backend/libpq/pqcomm.c            | 2 +-
 src/include/libpq/libpq-be.h          | 4 ++--
 src/include/libpq/libpq.h             | 2 +-
 6 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/src/backend/libpq/be-secure-gssapi.c b/src/backend/libpq/be-secure-gssapi.c
index bc04e78abba..809fe8b959b 100644
--- a/src/backend/libpq/be-secure-gssapi.c
+++ b/src/backend/libpq/be-secure-gssapi.c
@@ -92,7 +92,7 @@ static uint32 PqGSSMaxPktSize;	/* Maximum size we can encrypt and fit the
  * failure if necessary, and then return an errno indicating connection loss.
  */
 ssize_t
-be_gssapi_write(Port *port, void *ptr, size_t len)
+be_gssapi_write(Port *port, const void *ptr, size_t len)
 {
 	OM_uint32	major,
 				minor;
diff --git a/src/backend/libpq/be-secure-openssl.c b/src/backend/libpq/be-secure-openssl.c
index 29c9af1aabf..44ba6d8300c 100644
--- a/src/backend/libpq/be-secure-openssl.c
+++ b/src/backend/libpq/be-secure-openssl.c
@@ -799,7 +799,7 @@ be_tls_read(Port *port, void *ptr, size_t len, int *waitfor)
 }
 
 ssize_t
-be_tls_write(Port *port, void *ptr, size_t len, int *waitfor)
+be_tls_write(Port *port, const void *ptr, size_t len, int *waitfor)
 {
 	ssize_t		n;
 	int			err;
diff --git a/src/backend/libpq/be-secure.c b/src/backend/libpq/be-secure.c
index 1663f36b6b8..edf36139fe3 100644
--- a/src/backend/libpq/be-secure.c
+++ b/src/backend/libpq/be-secure.c
@@ -298,7 +298,7 @@ secure_raw_read(Port *port, void *ptr, size_t len)
  *	Write data to a secure connection.
  */
 ssize_t
-secure_write(Port *port, void *ptr, size_t len)
+secure_write(Port *port, const void *ptr, size_t len)
 {
 	ssize_t		n;
 	int			waitfor;
diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c
index a3e80281196..7d3df13f532 100644
--- a/src/backend/libpq/pqcomm.c
+++ b/src/backend/libpq/pqcomm.c
@@ -1370,7 +1370,7 @@ internal_flush_buffer(const char *buf, size_t *start, size_t *end)
 	{
 		int			r;
 
-		r = secure_write(MyProcPort, (char *) bufptr, bufend - bufptr);
+		r = secure_write(MyProcPort, bufptr, bufend - bufptr);
 
 		if (r <= 0)
 		{
diff --git a/src/include/libpq/libpq-be.h b/src/include/libpq/libpq-be.h
index 05cb1874c58..7f3e9118800 100644
--- a/src/include/libpq/libpq-be.h
+++ b/src/include/libpq/libpq-be.h
@@ -297,7 +297,7 @@ extern ssize_t be_tls_read(Port *port, void *ptr, size_t len, int *waitfor);
 /*
  * Write data to a secure connection.
  */
-extern ssize_t be_tls_write(Port *port, void *ptr, size_t len, int *waitfor);
+extern ssize_t be_tls_write(Port *port, const void *ptr, size_t len, int *waitfor);
 
 /*
  * Return information about the SSL connection.
@@ -337,7 +337,7 @@ extern bool be_gssapi_get_delegation(Port *port);
 
 /* Read and write to a GSSAPI-encrypted connection. */
 extern ssize_t be_gssapi_read(Port *port, void *ptr, size_t len);
-extern ssize_t be_gssapi_write(Port *port, void *ptr, size_t len);
+extern ssize_t be_gssapi_write(Port *port, const void *ptr, size_t len);
 #endif							/* ENABLE_GSS */
 
 extern PGDLLIMPORT ProtocolVersion FrontendProtocol;
diff --git a/src/include/libpq/libpq.h b/src/include/libpq/libpq.h
index 83e338f604a..d1fab83f8c0 100644
--- a/src/include/libpq/libpq.h
+++ b/src/include/libpq/libpq.h
@@ -105,7 +105,7 @@ extern void secure_destroy(void);
 extern int	secure_open_server(Port *port);
 extern void secure_close(Port *port);
 extern ssize_t secure_read(Port *port, void *ptr, size_t len);
-extern ssize_t secure_write(Port *port, void *ptr, size_t len);
+extern ssize_t secure_write(Port *port, const void *ptr, size_t len);
 extern ssize_t secure_raw_read(Port *port, void *ptr, size_t len);
 extern ssize_t secure_raw_write(Port *port, const void *ptr, size_t len);
 
-- 
2.34.1

v9-0001-Make-a-few-variables-size_t-in-pqcomm.c.patchapplication/octet-stream; name=v9-0001-Make-a-few-variables-size_t-in-pqcomm.c.patchDownload

From def294980e01ca412f83fdfcd0b37d27726f2c0f Mon Sep 17 00:00:00 2001
From: Jelte Fennema-Nio <jelte.fennema@microsoft.com>
Date: Mon, 8 Apr 2024 12:40:01 +0200
Subject: [PATCH v9 1/3] Make a few variables size_t in pqcomm.c

This is needed since c4ab7da6061 to avoid overflows in case int is
smaller than size_t.

Reported-By: Ranier Vilela
---
 src/backend/libpq/pqcomm.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c
index 2cee49a2085..af43eef2eee 100644
--- a/src/backend/libpq/pqcomm.c
+++ b/src/backend/libpq/pqcomm.c
@@ -119,7 +119,7 @@ static List *sock_paths = NIL;
 #define PQ_RECV_BUFFER_SIZE 8192
 
 static char *PqSendBuffer;
-static int	PqSendBufferSize;	/* Size send buffer */
+static size_t PqSendBufferSize; /* Size send buffer */
 static size_t PqSendPointer;	/* Next index to store a byte in PqSendBuffer */
 static size_t PqSendStart;		/* Next index to send a byte in PqSendBuffer */
 
@@ -1521,7 +1521,7 @@ static void
 socket_putmessage_noblock(char msgtype, const char *s, size_t len)
 {
 	int			res PG_USED_FOR_ASSERTS_ONLY;
-	int			required;
+	size_t		required;
 
 	/*
 	 * Ensure we have enough space in the output buffer for the message header

base-commit: 422041542f313f23ca66cad26e9b2b99c4d1999a
-- 
2.34.1

v9-0002-Expand-comment-of-internal_flush_buffer.patchapplication/octet-stream; name=v9-0002-Expand-comment-of-internal_flush_buffer.patchDownload

From d32640d7500714e755b1b2caeeae06e8a565ce8c Mon Sep 17 00:00:00 2001
From: Jelte Fennema-Nio <jelte.fennema@microsoft.com>
Date: Mon, 8 Apr 2024 13:27:35 +0200
Subject: [PATCH v9 2/3] Expand comment of internal_flush_buffer

It didn't explain that start and end would be updated, which seems
fairly important.
---
 src/backend/libpq/pqcomm.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c
index af43eef2eee..a3e80281196 100644
--- a/src/backend/libpq/pqcomm.c
+++ b/src/backend/libpq/pqcomm.c
@@ -1351,6 +1351,9 @@ internal_flush(void)
 /* --------------------------------
  *		internal_flush_buffer - flush the given buffer content
  *
+ * Updates start and end to reflect the portion of the buffer that still
+ * needs to be sent.
+ *
  * Returns 0 if OK (meaning everything was sent, or operation would block
  * and the socket is in non-blocking mode), or EOF if trouble.
  * --------------------------------
-- 
2.34.1

#53

Ranier Vilela

ranier.vf@gmail.com

almost 2 years ago

In reply to: Jelte Fennema-Nio (#52)

Re: Flushing large data immediately in pqcomm

Em seg., 8 de abr. de 2024 às 09:27, Jelte Fennema-Nio <postgres@jeltef.nl>
escreveu:

On Sun, 7 Apr 2024 at 11:34, David Rowley <dgrowleyml@gmail.com> wrote:

That seems to require modifying the following function signatures:
secure_write(), be_tls_write(), be_gssapi_write(). That's not an area
I'm familiar with, however.

Attached is a new patchset where 0003 does exactly that. The only
place where we need to cast to non-const is for GSS, but that seems
fine (commit message has more details).

+1.
Looks ok to me.
The GSS pointer *ptr, is already cast to char * where it is needed,
so the code is already correct.

best regards,
Ranier Vilela