MPTCP - multiplexing many TCP connections through one socket to get better bandwidth

Started by Jakub Wartak4 months ago1 messages
#1Jakub Wartak
jakub.wartak@enterprisedb.com
1 attachment(s)

Hi -hackers,

With the attached patch PostgreSQL could possibly gain built-in MPTCP
support which would allow multiplexing (aggregating) multiple
kernel-based TCP streams into one MPTCP socket. This allows bypassing
any "chokepoints" on the network transparently for libpq, especially
if having *multiple* TCP streams could achieve higher bandwidth than
single one. One can think of transparent aggregation of bandwidth over
multiple WAN links/tunnels and so. In short it works like this:
libpq_client <--MPTCP--> client_kernel <==multiple TCP
connections==> server_kernel <--MPTCP--> server_kernel

Without much rework of PostgreSQL, this means accelerating any
libpq-based use case. Most obvious beneficiaries could be any
libpq-based heavy network transfers, especially in enterprise
networks. Those come to my mind:
- pg_basebackup (over e.g. WAN or multiple interfaces; but also one
can think of using 2x 10GigE over LAN)
- streaming replication or logical replication [years ago I've was
able to use MPTCP with colleagues on production to bypass single TCP
stream limitation of streaming replication]
- COPY (both upload and download)
- postgres_fdw/dblinks?

MPTCP is IETF standard and included from Linux kernels from some time
(realistically 5.16+?) and it's *enabled* by default in most modern
distributions. One could use it with mptcpize (LD_PRELOAD wrapper to
hijack socket()), but it's not elegant and would require altering
systemd startup scripts (the same story like with NUMA: literally
nobody hacking those to just include numactl --interleave there or
with adjusting ulimits).

The patch right now just assumes IPPROTO_MPTCP is there, so it is not
portable, but not that many OSes support it at all -- I think #ifdef
would be good enough for now. I dont have access to MacOS to develop
this more there, nor I think it would add benefit there, but I may be
wrong. So as such the proposed patch is trivial and Linux-only,
although there is RFC8684[1]https://en.wikipedia.org/wiki/Multipath_TCP[2]https://www.rfc-editor.org/rfc/rfc8684.html. I suspect it is way easier and
simpler to support it , rather than try to solve the same problem for
each of the listed use-cases.

Simulation, basic-use and tests:

1. Strictly for demo purposes here, we need to ARTIFICIALLY limit
outbound bandwidth for each new flow (TCP connection) to 10 Mbit/s
using `tc` on the server where PostgreSQL is going to be running later
on (this simulates some chokepoints, multiple WAN paths):
DEV=enp0s31f6
tc qdisc add dev $DEV root handle 1: htb
tc class add dev $DEV parent 1: classid 1:1 htb rate 100mbit
for i in `seq 1 9`; do
tc class add dev $DEV parent 1:1 classid 1:$i htb rate 10mbit
ceil 10mbit
done
# see tc-flow(8) for details, classify each flow with port into
separate class (1:X)
tc filter add dev $DEV parent 1: protocol ip prio 1 handle 1 flow
hash keys src,dst,proto,proto-src,proto-dst divisor 8 baseclass 1

2. From client, verify that single TCP bandwidth is really limited:
verify using iperf3 -P 1 -R -c <server> # if you really getting
limited single-stream TCP connection instead of full
verify using iperf3 -P 8 -R -c <server> # if you really getting
more bandwidth than above

3. Check if MPTCP is enabled and configured on both sides
uname -r # at least 5.10+ according [4]https://github.com/multipath-tcp/mptcp_net-next/wiki/#changelog to get this balancing
working, but 6.1+ LTS highly recommended (I've used 6.14.x)
sysctl net.mptcp.enabled # should be 1 on both sides by default
ip mptcp limits set subflows 8 add_addr_accepted 8 # but feel
free to setup max limits

4. Configure MPTCP endpoints on the server (registers some dedicated
listening ports for MPTCP use so that there's no need to use multiple
IP aliases or PBR):
ps uaxw | grep -i mptcpd # check if mptcp daemon (path manager is
running or not), it is NOT required in this case
ip addr ls # let's assume 10.0.1.240 is my main IP on eno1 device,
no need to add new IPs thanks to below trick:
ip mptcp endpoint show # to verify
#ip mptcp endpoint flush # if necessary
# below registers ports 5202..5205 as LISTENing by kernel and
dedicated for MPTCP subflows
ip mptcp endpoint add 10.0.1.240 dev eno1 port 5202 signal
ip mptcp endpoint add 10.0.1.240 dev eno1 port 5203 signal
ip mptcp endpoint add 10.0.1.240 dev eno1 port 5204 signal
ip mptcp endpoint add 10.0.1.240 dev eno1 port 5205 signal
ip mptcp endpoint show # to verify

5. Configure the client:
ip addr ls # here I got 10.0.1.250
ip mptcp endpoint show
ip mptcp endpoint add 10.0.1.250 dev enp0s31f6 subflow fullmesh #
not sure fullmesh is necessary, probably not
ip mptcp limits set add_addr_accepted 8 subflows 8

6. Verify that MPTCP works, rerun tests with mptcpize, e.g.:
on server: mptcpize run iperf3 -s
on client: mptcpize run -d iperf3 -P 1 -R -c <server> # should get
better bandwidth but using just 1 MPTCP connection
on server run PostgreSQL with listen_mptcp='on'
on server: ss -Mtlnp sport 5432 # mptcp should be displayed
on client: run basebackup/psql/..

Sample results for 82MB table copy, it's 3x:
$ time PGMPTCP=0 /usr/pgsql19/bin/psql -h 10.0.1.240 -c '\copy
pgbench_accounts TO '/dev/null';'
COPY 500000
real 0m42.123s

$ time PGMPTCP=1 /usr/pgsql19/bin/psql -h 10.0.1.240 -c '\copy
pgbench_accounts TO '/dev/null';'
enabling MPTCP client
COPY 500000
real 0m14.416s

Sample results for pgbench of DB created with: pgbench -i -s 5,
~1076MB total due to WALs
$ time /usr/pgsql19/bin/pg_basebackup -h 10.0.1.240 -c fast -D /tmp/test -v
pg_basebackup: initiating base backup, waiting for checkpoint to complete
pg_basebackup: checkpoint completed
[..]
pg_basebackup: base backup completed
real 1m26.786s

With PGMPTCP=1 set, it gets ~3x
$ time PGMPTCP=1 /usr/pgsql19/bin/pg_basebackup -h 10.0.1.240 -c
fast -D /tmp/test -v
enabling MPTCP client
pg_basebackup: initiating base backup, waiting for checkpoint to complete
[..]
pg_basebackup: starting background WAL receiver
enabling MPTCP client
[..]
pg_basebackup: base backup completed
real 0m30.460s

Because in the above case, we have advertised 4 IP addresses/port of
server to the client, we got the bump on a single socket (note: flows
end up being hashed into various HTB classes is random depending on
ports used you can get usually 2x .. 4x here). Also as there are two
independent application-based connections here in basebackup (transfer
+ WALs), both get multiplexed (each with 4 subflows). If I would add
more ip mptcp ports (server-side), we could get even more juice of
course there, but it assumes one has that many paths. Some more
advanced setups including separate policy-based-routed (ip rule)
things are possible, and stuff like keeping the TCP connection highly
available 0 even across ISP/interface (WiFi?) outages - is possible.
It works transparently with SSL/TLS too - tested. Of course it won't
remove the single CPU limitation of the tools involved (that's
completely different problem).

If it sounds interesting I was thinking about adding to the patch
something like contrib/mptcpinfo (pg_stat_mptcp view to mimic
pg_stat_ssl). Also as for the patch there were some places where
socket() is being created (libpq cancel packet), but there's no
purpose of adding MPTCP there I think.

It is important to mention there are two implementations of MPTCP on
Linux, so when someone will be googling there's lots of conflicting
information:
1) Earlier one, required kernel patching up to <= 5.6, had
"ndiffports" multiplexer built-in which worked mostly out of the box.
2) Newer one [3]https://www.mptcp.dev/, already merged one into kernel today, a little bit
different does not come with built-in ndiffports path manager. In this
newer one, as shown above some more manual steps (ip mptcp endpoints)
may be required, but mptcpd daemon which is managing (sub)flows seems
to be evolving as the usage of this protocol is rising. So I hope in
future all of those mptcp commands would be probably optional.

Thoughts?

-Jakub Wartak.

[1]: https://en.wikipedia.org/wiki/Multipath_TCP
[2]: https://www.rfc-editor.org/rfc/rfc8684.html
[3]: https://www.mptcp.dev/
[4]: https://github.com/multipath-tcp/mptcp_net-next/wiki/#changelog

Attachments:

v1-0001-Add-MPTCP-protocol-support-to-server-and-libpq-on.patchapplication/octet-stream; name=v1-0001-Add-MPTCP-protocol-support-to-server-and-libpq-on.patchDownload
From e9855281fc14ab475362eb32b697a82b18259ff5 Mon Sep 17 00:00:00 2001
From: Jakub Wartak <jakub.wartak@enterprisedb.com>
Date: Thu, 4 Sep 2025 11:39:54 +0200
Subject: [PATCH v1] Add MPTCP protocol support to server and libpq on Linux.

This adds new listen_mptcp configuration option and also exposes new
enviornimental variable PGMPTCP, which can be enabled to request
MultiPathed TCP connection.
---
 doc/src/sgml/libpq.sgml                   | 26 +++++++++++++++++++++++
 src/backend/libpq/pqcomm.c                | 20 ++++++++++++++++-
 src/backend/postmaster/postmaster.c       |  3 +++
 src/backend/utils/misc/guc_parameters.dat |  6 ++++++
 src/include/postmaster/postmaster.h       |  1 +
 src/interfaces/libpq/fe-connect.c         | 17 ++++++++++++++-
 src/interfaces/libpq/libpq-int.h          |  1 +
 7 files changed, 72 insertions(+), 2 deletions(-)

diff --git a/doc/src/sgml/libpq.sgml b/doc/src/sgml/libpq.sgml
index 5bf59a19855..6e114477f8a 100644
--- a/doc/src/sgml/libpq.sgml
+++ b/doc/src/sgml/libpq.sgml
@@ -2568,6 +2568,22 @@ postgresql://%2Fvar%2Flib%2Fpostgresql/dbname
       </listitem>
      </varlistentry>
 
+     <varlistentry id="libpq-connect-mptcp" xreflabel="mptcp">
+      <term><literal>MPTCP</literal><indexterm><primary>MultiPath TCP</primary></indexterm></term>
+      <listitem>
+       <para>
+        Controls whether client-side MPTCP protocol is used. The default
+        value is 0, meaning off, but you can change this to 1, meaning on.
+        This parameter is ignored for connections made via a Unix-domain socket.
+       </para>
+
+       <para>
+        MPTCP protocol is only supported on Linux and allows connection aggregation
+        (multiplexing) over mulitple network paths, provided that remote also
+        supports MPTCP.
+       </para>
+      </listitem>
+     </varlistentry>
     </variablelist>
    </para>
   </sect2>
@@ -9178,6 +9194,16 @@ myEventProc(PGEventId evtId, void *evtInfo, void *passThrough)
      </para>
     </listitem>
 
+    <listitem>
+     <para>
+      <indexterm>
+       <primary><envar>PGMPTCP</envar></primary>
+      </indexterm>
+      <envar>PGMPTCP</envar> behaves the same as the <xref
+      linkend="libpq-connect-mptcp"/> connection parameter.
+     </para>
+    </listitem>
+
     <listitem>
      <para>
       <indexterm>
diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c
index 25f739a6a17..8cddc96b004 100644
--- a/src/backend/libpq/pqcomm.c
+++ b/src/backend/libpq/pqcomm.c
@@ -438,6 +438,15 @@ ListenServerPort(int family, const char *hostName, unsigned short portNumber,
 	int			one = 1;
 #endif
 
+#ifndef IPPROTO_MPTCP
+	if (ListenMPTCP)
+	{
+		ereport(WARNING,
+				(errmsg("setting the MPTCP listening socket is not supported on this platform")));
+		return STATUS_ERROR;
+	}
+#endif
+
 	/* Initialize hint structure */
 	MemSet(&hint, 0, sizeof(hint));
 	hint.ai_family = family;
@@ -487,6 +496,8 @@ ListenServerPort(int family, const char *hostName, unsigned short portNumber,
 
 	for (addr = addrs; addr; addr = addr->ai_next)
 	{
+		int			ipprotocol = 0;
+
 		if (family != AF_UNIX && addr->ai_family == AF_UNIX)
 		{
 			/*
@@ -538,7 +549,14 @@ ListenServerPort(int family, const char *hostName, unsigned short portNumber,
 			addrDesc = addrBuf;
 		}
 
-		if ((fd = socket(addr->ai_family, SOCK_STREAM, 0)) == PGINVALID_SOCKET)
+		/*
+		 * enable MPTCP only on IP and IPv6 sockets and not for UNIX domain
+		 * sockets
+		 */
+		if (addr->ai_family != AF_UNIX)
+			ipprotocol = ListenMPTCP ? IPPROTO_MPTCP : 0;
+
+		if ((fd = socket(addr->ai_family, SOCK_STREAM, ipprotocol)) == PGINVALID_SOCKET)
 		{
 			ereport(LOG,
 					(errcode_for_socket_access(),
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index e1d643b013d..07b388e88b5 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -208,6 +208,9 @@ char	   *Unix_socket_directories;
 /* The TCP listen address(es) */
 char	   *ListenAddresses;
 
+/* Whether to use MPTCP */
+bool		ListenMPTCP;
+
 /*
  * SuperuserReservedConnections is the number of backends reserved for
  * superuser use, and ReservedConnections is the number of backends reserved
diff --git a/src/backend/utils/misc/guc_parameters.dat b/src/backend/utils/misc/guc_parameters.dat
index a157cec3c4d..257b288aaee 100644
--- a/src/backend/utils/misc/guc_parameters.dat
+++ b/src/backend/utils/misc/guc_parameters.dat
@@ -351,6 +351,12 @@
   boot_val => 'DEFAULT_ASSERT_ENABLED',
 },
 
+{ name => 'listen_mptcp', type => 'bool', context => 'PGC_POSTMASTER', group => 'CONN_AUTH_SETTINGS',
+  short_desc => 'Whether to enable MPTCP on the listening socket',
+  variable => 'ListenMPTCP',
+  boot_val => 'false',
+},
+
 { name => 'exit_on_error', type => 'bool', context => 'PGC_USERSET', group => 'ERROR_HANDLING_OPTIONS',
   short_desc => 'Terminate session on any error.',
   variable => 'ExitOnAnyError',
diff --git a/src/include/postmaster/postmaster.h b/src/include/postmaster/postmaster.h
index 92497cd6a0f..ca4cf7ea295 100644
--- a/src/include/postmaster/postmaster.h
+++ b/src/include/postmaster/postmaster.h
@@ -60,6 +60,7 @@ extern PGDLLIMPORT int Unix_socket_permissions;
 extern PGDLLIMPORT char *Unix_socket_group;
 extern PGDLLIMPORT char *Unix_socket_directories;
 extern PGDLLIMPORT char *ListenAddresses;
+extern PGDLLIMPORT bool ListenMPTCP;
 extern PGDLLIMPORT bool ClientAuthInProgress;
 extern PGDLLIMPORT int PreAuthDelay;
 extern PGDLLIMPORT int AuthenticationTimeout;
diff --git a/src/interfaces/libpq/fe-connect.c b/src/interfaces/libpq/fe-connect.c
index a3d12931fff..abe045ec474 100644
--- a/src/interfaces/libpq/fe-connect.c
+++ b/src/interfaces/libpq/fe-connect.c
@@ -415,6 +415,10 @@ static const internalPQconninfoOption PQconninfoOptions[] = {
 		"SSL-Key-Log-File", "D", 64,
 	offsetof(struct pg_conn, sslkeylogfile)},
 
+	{"mptcp", "PGMPTCP", "0", NULL,
+		"MPTCP-Protocol", "", 1,
+	offsetof(struct pg_conn, mptcp)},
+
 	/* Terminating entry --- MUST BE LAST */
 	{NULL, NULL, NULL, NULL,
 	NULL, NULL, 0}
@@ -3236,6 +3240,7 @@ keep_going:						/* We will come back to here until there is
 					char		host_addr[NI_MAXHOST];
 					int			sock_type;
 					AddrInfo   *addr_cur;
+					int			ip_protocol = 0;
 
 					/*
 					 * Advance to next possible host, if we've tried all of
@@ -3321,7 +3326,17 @@ keep_going:						/* We will come back to here until there is
 					 */
 					sock_type |= SOCK_NONBLOCK;
 #endif
-					conn->sock = socket(addr_cur->family, sock_type, 0);
+
+					/*
+					 * enable MPTCP only on IP and IPv6 sockets and not for
+					 * UNIX domain sockets
+					 */
+					if (addr_cur->family != AF_UNIX && conn->mptcp && conn->mptcp[0] == '1')
+					{
+						fprintf(stderr, "enabling MPTCP client\n");
+						ip_protocol = IPPROTO_MPTCP;
+					}
+					conn->sock = socket(addr_cur->family, sock_type, ip_protocol);
 					if (conn->sock == PGINVALID_SOCKET)
 					{
 						int			errorno = SOCK_ERRNO;
diff --git a/src/interfaces/libpq/libpq-int.h b/src/interfaces/libpq/libpq-int.h
index 02c114f1405..976c6554803 100644
--- a/src/interfaces/libpq/libpq-int.h
+++ b/src/interfaces/libpq/libpq-int.h
@@ -430,6 +430,7 @@ struct pg_conn
 	char	   *scram_client_key;	/* base64-encoded SCRAM client key */
 	char	   *scram_server_key;	/* base64-encoded SCRAM server key */
 	char	   *sslkeylogfile;	/* where should the client write ssl keylogs */
+	char	   *mptcp;			/* use MPTCP ? */
 
 	bool		cancelRequest;	/* true if this connection is used to send a
 								 * cancel request, instead of being a normal
-- 
2.39.5