BUG #17005: Enhancement request: Improve walsender throughput by aggregating multiple messages in one send
The following bug has been logged on the website:
Bug reference: 17005
Logged by: Rony Kurniawan
Email address: rony.kurniawan@oracle.com
PostgreSQL version: 11.7
Operating system: Oracle Linux Server release 7.9
Description:
Hi,
I measured the throughput of reading the logical replication slot and found
that in smaller row size (512 bytes) the throughput is 50% lower compared to
1024 bytes.
tcpdump shows that ethernet packets sent by the replication server contain
only one message per packet (see tcpdump output below).
May be this is the intended design to achieve low latency but this is not
favorable in application that requires high throughput.
Is it possible for PostgreSQL to enable Nagle's algorithm on the streaming
socket for replication?
Or aggegate the messages manually before sending them in one send()?
Thank you,
Rony
test case:
client and server are on different machines or run the server in a docker.
create table public.test (id integer generated always as identity, name
varchar(512));
alter table public.test replica identity full;
select * from pg_create_logical_replication_slot('testslot',
'test_decoding');
insert into public.test (name) values (rpad('a', 512, 'a'));
...
insert into public.test (name) values (rpad('a', 512, 'a'));
I used pgbench to insert million of records to the test table to measure the
throughput, but one insert is enough to show how the server send the
message.
client terminal 1:
$ sudo tcpdump -D
1.enp0s3
2.virbr0
3.docker0
$ sudo tcpdump -i 3 -w psql.pcap "tcp port 5432"
client terminal 2:
$ pg_recvlogical --start --slot=testslot -d postgres -h 172.17.0.2 -U
postgres -f -
client terminal 1:
$ sudo tcpdump -i 3 -w psql.pcap "tcp port 5432"
ctrl-c
37 packets captured
37 packets received by filter
0 packets dropped by kernel
$ tcpdump --number -nn -A -r psql.pcap
...
22 16:38:37.217677 IP 172.17.0.1.56140 > 172.17.0.2.5432:
...START_REPLICATION SLOT "testslot" LOGICAL 0/0.
...
28 16:38:37.218209 IP 172.17.0.2.5432 > 172.17.0.1.56140: ...BEGIN
1888650
...
30 16:38:37.218332 IP 172.17.0.2.5432 > 172.17.0.1.56140: ...table
public.test: INSERT: id[integer]: 1 name[character
varying]:'aaa...512...aaa'
31 16:38:37.218345 IP 172.17.0.2.5432 > 172.17.0.1.56140: ...COMMIT
1888650
Hi,
On 2021-05-13 00:31:53 +0000, PG Bug reporting form wrote:
I measured the throughput of reading the logical replication slot and found
that in smaller row size (512 bytes) the throughput is 50% lower compared to
1024 bytes.
Huh, that is interesting.
tcpdump shows that ethernet packets sent by the replication server contain
only one message per packet (see tcpdump output below).
May be this is the intended design to achieve low latency but this is not
favorable in application that requires high throughput.
What kind of network is this? I would have expected that if the network
can't keep up the small sends would end up getting aggregated into
larger packets anyway? Are you hitting a PPS limit due to the small
packages, but not yet the throughput limit?
Is it possible for PostgreSQL to enable Nagle's algorithm on the streaming
socket for replication?
Or aggegate the messages manually before sending them in one send()?
I think we can probably do better in cases a transaction is more than a
single change - but I don't think either enabling nagle's or aggregation
are really an option in case of single row transactions. The latency
impact for some scenarios seems too high.
Greetings,
Andres Freund
Andres Freund писал 2021-05-17 01:44:
Hi,
On 2021-05-13 00:31:53 +0000, PG Bug reporting form wrote:
I measured the throughput of reading the logical replication slot and
found
that in smaller row size (512 bytes) the throughput is 50% lower
compared to
1024 bytes.Huh, that is interesting.
tcpdump shows that ethernet packets sent by the replication server
contain
only one message per packet (see tcpdump output below).
May be this is the intended design to achieve low latency but this is
not
favorable in application that requires high throughput.What kind of network is this? I would have expected that if the network
can't keep up the small sends would end up getting aggregated into
larger packets anyway? Are you hitting a PPS limit due to the small
packages, but not yet the throughput limit?
I believe the reason is more in sys-call and kernel cpu time overhead
than in
network throughput. Especially in this "after meltdown+spectre" time.
Is it possible for PostgreSQL to enable Nagle's algorithm on the
streaming
socket for replication?
Or aggegate the messages manually before sending them in one send()?I think we can probably do better in cases a transaction is more than a
single change - but I don't think either enabling nagle's or
aggregation
are really an option in case of single row transactions. The latency
impact for some scenarios seems too high.
We have commit_siblings and commit_delay options. Probably similar could
be introduced for replication slot?
regards
Yura
Hi,
On 2021-05-17 16:21:36 +0300, Yura Sokolov wrote:
Andres Freund писал 2021-05-17 01:44:
What kind of network is this? I would have expected that if the network
can't keep up the small sends would end up getting aggregated into
larger packets anyway? Are you hitting a PPS limit due to the small
packages, but not yet the throughput limit?I believe the reason is more in sys-call and kernel cpu time overhead than
in network throughput. Especially in this "after meltdown+spectre" time.
Well, enabling Nagle's wouldn't change anything if the issue is just
syscall and not network overhead. Yet the ask was to enable Nagle's...
Greetings,
Andres Freund
Hi,
On 5/17/2021 9:27 AM, Andres Freund wrote:
Hi,
On 2021-05-17 16:21:36 +0300, Yura Sokolov wrote:
Andres Freund писал 2021-05-17 01:44:
What kind of network is this? I would have expected that if the network
can't keep up the small sends would end up getting aggregated into
larger packets anyway? Are you hitting a PPS limit due to the small
packages, but not yet the throughput limit?I believe the reason is more in sys-call and kernel cpu time overhead than
in network throughput. Especially in this "after meltdown+spectre" time.Well, enabling Nagle's wouldn't change anything if the issue is just
syscall and not network overhead. Yet the ask was to enable Nagle's...Greetings,
Andres Freun
The networks that I tested were gigabits and docker (local). With
TCP_NODELAY enabled, the only time small sends would be aggregated is by
auto corking in tcp/ip when there is network congestion. But as you can
see from the tcpdump output the messages are in individual packet
therefore there is no aggregation and no network congestion.
There is network overhead in both sender and receiver like tcp/ip
header, number of skb, ethernet tx/rx descriptors, and interrupts. Also
syscall overhead in pg_recvlogical where for one insert in the example
requires 3 recv() calls to read BEGIN, INSERT, COMMIT messages instead
of one recv() to read all three messages when Nagle's is enabled. This
syscall overhead is the same in transaction case with multiple changes
where each change is one recv().
I agree that in some cases low latency in replication is required, but
there are also cases where high throughput is desired especially when
the standby server is behind due to outage where latency doesn't exist.
I experimented by simply disabling TCP_NODELAY in
walsender.c:StartLogicalReplication() and the throughput went up by 60%.
This is just a prove of concept that some kind of message aggregations
would result to higher throughput.
Thank you,
Rony
Hi,
On 2021-05-17 11:19:31 -0700, Rony Kurniawan wrote:
The networks that I tested were gigabits and docker (local). With
TCP_NODELAY enabled, the only time small sends would be aggregated is by
auto corking in tcp/ip when there is network congestion. But as you can see
from the tcpdump output the messages are in individual packet therefore
there is no aggregation and no network congestion.
I don't understand why "individual packages" implies that there can be
no network congestion? Or are you just saying that in the specific
period traced you didn't observe that?
I just verified this with iperf - I see large packets with
iperf -l 500 --nodelay -c $other_host
but not
iperf -b 10M -l 500 --nodelay -c $other_host
I had to remember how to disable tcp segmentation offloading to see
proper package sizes in the first case, without there were a lot of
65226 byte sized packets in the first case...
There is network overhead in both sender and receiver like tcp/ip header,
number of skb, ethernet tx/rx descriptors, and interrupts.
Right.
Also syscall overhead in pg_recvlogical where for one insert in the
example requires 3 recv() calls to read BEGIN, INSERT, COMMIT messages
instead of one recv() to read all three messages when Nagle's is
enabled. This syscall overhead is the same in transaction case with
multiple changes where each change is one recv().
I think the obvious and unproblematic improvement is to only send data
to the socket if WalSndWriteData's last_write parameter is set, or if
there's a certain amount of data in the socket. That'll only get rid of
some of the overhead, since we'd still send things like transactions
separately.
Another improvement might be that WalSndWriteData() possibly shouldn't
block even if pq_is_send_pending() and the pending amount isn't huge,
iff !last_write. That way we'd end up doing syscalls sending more data
at once.
Greetings,
Andres Freund
On 5/17/2021 11:54 AM, Andres Freund wrote:
Hi,
On 2021-05-17 11:19:31 -0700, Rony Kurniawan wrote:
The networks that I tested were gigabits and docker (local). With
TCP_NODELAY enabled, the only time small sends would be aggregated is by
auto corking in tcp/ip when there is network congestion. But as you can see
from the tcpdump output the messages are in individual packet therefore
there is no aggregation and no network congestion.I don't understand why "individual packages" implies that there can be
no network congestion? Or are you just saying that in the specific
period traced you didn't observe that?
Since TCP_NODELAY=0 in PosgreSQL then it is up to the kernel to
aggregate those sends. In case of auto corking, it happens when the NIC
has outstanding packet in the tx queue due to network congestion or the
NIC can not catch up with the amount of send() by the application.
On a gigabit ethernet, the amount of data produced by the logical
replication server is not enough to trigger auto corking or other
aggregation hence the individual packet per message. Although,
aggregation could still happened sometimes.
In my bigger test case using pgbench to insert 20 records/transaction
for 1 minute, I see some bigger packets but they are mostly 629 bytes.
I just verified this with iperf - I see large packets with
iperf -l 500 --nodelay -c $other_host
but not
iperf -b 10M -l 500 --nodelay -c $other_hostI had to remember how to disable tcp segmentation offloading to see
proper package sizes in the first case, without there were a lot of
65226 byte sized packets in the first case...There is network overhead in both sender and receiver like tcp/ip header,
number of skb, ethernet tx/rx descriptors, and interrupts.Right.
Also syscall overhead in pg_recvlogical where for one insert in the
example requires 3 recv() calls to read BEGIN, INSERT, COMMIT messages
instead of one recv() to read all three messages when Nagle's is
enabled. This syscall overhead is the same in transaction case with
multiple changes where each change is one recv().I think the obvious and unproblematic improvement is to only send data
to the socket if WalSndWriteData's last_write parameter is set, or if
there's a certain amount of data in the socket. That'll only get rid of
some of the overhead, since we'd still send things like transactions
separately.Another improvement might be that WalSndWriteData() possibly shouldn't
block even if pq_is_send_pending() and the pending amount isn't huge,
iff !last_write. That way we'd end up doing syscalls sending more data
at once.
Thank you for looking into this,
Rony
On Mon, May 17, 2021 at 4:14 AM Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2021-05-13 00:31:53 +0000, PG Bug reporting form wrote:
I measured the throughput of reading the logical replication slot and found
that in smaller row size (512 bytes) the throughput is 50% lower compared to
1024 bytes.Huh, that is interesting.
tcpdump shows that ethernet packets sent by the replication server contain
only one message per packet (see tcpdump output below).
May be this is the intended design to achieve low latency but this is not
favorable in application that requires high throughput.What kind of network is this? I would have expected that if the network
can't keep up the small sends would end up getting aggregated into
larger packets anyway? Are you hitting a PPS limit due to the small
packages, but not yet the throughput limit?Is it possible for PostgreSQL to enable Nagle's algorithm on the streaming
socket for replication?
Or aggegate the messages manually before sending them in one send()?I think we can probably do better in cases a transaction is more than a
single change - but I don't think either enabling nagle's or aggregation
are really an option in case of single row transactions. The latency
impact for some scenarios seems too high.
Can we think of combining Begin single_change Commit and send them as
one message for a single-row change xacts?
--
With Regards,
Amit Kapila.