GUC-ify walsender MAX_SEND_SIZE constant
Hello all,
Following the threads here
</messages/by-id/CAFWczPvi_5FWH+JTqkWbi+w83hy=MYg=2hKK0=JZBe9=hTpE4w@mail.gmail.com>
and there <https://commitfest.postgresql.org/13/958/>, I decided to submit
this patch.
Following is the description which is also written in the commit message:
MAX_SEND_SIZE parameter was used in WALSender to limit maximum size of
a WAL data packet sent to a WALReceiver during replication. Although
its default value (128kB) was a reasonable value, it was written in
2010. Since the hardwares have changed a lot since then, a PostgreSQL
user might need to customize this value.
For example, if a database's disk has a high bandwidth and a high
latency at the same time, it makes more sense to read bigger chunks of
data from disk in each iteration. One example of such disks is a remote
disk. (e.g. an RBD volume)
However, this value does not need to be larger than wal_segment_size,
thus its checker function returns false if a larger value is set for
this.
This is my first patch. So, I hope I haven't done something wrong. :'D
Best regards
Majid
Attachments:
v1-0001-GUC-ify-MAX_SEND_SIZE-constant-in-WALSender.patchtext/x-patch; charset=US-ASCII; name=v1-0001-GUC-ify-MAX_SEND_SIZE-constant-in-WALSender.patchDownload
From 218a925d1161878c888bef8adb3f246ec2b44d12 Mon Sep 17 00:00:00 2001
From: Majid Garoosi <amoomajid99@gmail.com>
Date: Fri, 19 Jan 2024 22:48:23 +0330
Subject: [PATCH v1] GUC-ify MAX_SEND_SIZE constant in WALSender
MAX_SEND_SIZE parameter was used in WALSender to limit maximum size of
a WAL data packet sent to a WALReceiver during replication. Although
its default value (128kB) was a reasonable value, it was written in
2010. Since the hardwares have changed a lot since then, a PostgreSQL
user might need to customize this value.
For example, if a database's disk has a high bandwidth and a high
latency at the same time, it makes more sense to read bigger chunks of
data from disk in each iteration. One example of such disks is a remote
disk. (e.g. an RBD volume)
However, this value does not need to be larger than wal_segment_size,
thus its checker function returns false if a larger value is set for
this.
---
src/backend/replication/walsender.c | 18 +++++-------------
src/backend/utils/init/postinit.c | 11 +++++++++++
src/backend/utils/misc/guc_tables.c | 13 +++++++++++++
src/include/replication/walsender.h | 1 +
src/include/utils/guc_hooks.h | 1 +
5 files changed, 31 insertions(+), 13 deletions(-)
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 087031e9dc..0299859d97 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -96,17 +96,6 @@
#include "utils/timeout.h"
#include "utils/timestamp.h"
-/*
- * Maximum data payload in a WAL data message. Must be >= XLOG_BLCKSZ.
- *
- * We don't have a good idea of what a good value would be; there's some
- * overhead per message in both walsender and walreceiver, but on the other
- * hand sending large batches makes walsender less responsive to signals
- * because signals are checked only between messages. 128kB (with
- * default 8k blocks) seems like a reasonable guess for now.
- */
-#define MAX_SEND_SIZE (XLOG_BLCKSZ * 16)
-
/* Array of WalSnds in shared memory */
WalSndCtlData *WalSndCtl = NULL;
@@ -124,6 +113,9 @@ int max_wal_senders = 10; /* the maximum number of concurrent
* walsenders */
int wal_sender_timeout = 60 * 1000; /* maximum time to send one WAL
* data message */
+int wal_sender_max_send_size = XLOG_BLCKSZ * 16; /* Maximum data
+ * payload in a WAL
+ * data message */
bool log_replication_commands = false;
/*
@@ -3087,7 +3079,7 @@ XLogSendPhysical(void)
*/
startptr = sentPtr;
endptr = startptr;
- endptr += MAX_SEND_SIZE;
+ endptr += wal_sender_max_send_size;
/* if we went beyond SendRqstPtr, back off */
if (SendRqstPtr <= endptr)
@@ -3106,7 +3098,7 @@ XLogSendPhysical(void)
}
nbytes = endptr - startptr;
- Assert(nbytes <= MAX_SEND_SIZE);
+ Assert(nbytes <= wal_sender_max_send_size);
/*
* OK to read and send the slice.
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 1ad3367159..8d61b006c4 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -617,6 +617,17 @@ check_max_wal_senders(int *newval, void **extra, GucSource source)
return true;
}
+/*
+ * GUC check_hook for wal_sender_max_send_size
+ */
+bool
+check_wal_sender_max_send_size(int *newval, void **extra, GucSource source)
+{
+ if (*newval > wal_segment_size)
+ return false;
+ return true;
+}
+
/*
* Early initialization of a backend (either standalone or under postmaster).
* This happens even before InitPostgres.
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index e53ebc6dc2..76f755cbd3 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2892,6 +2892,19 @@ struct config_int ConfigureNamesInt[] =
NULL, NULL, NULL
},
+ {
+ {"wal_sender_max_send_size", PGC_SIGHUP, REPLICATION_SENDING,
+ gettext_noop("Sets size of the maximum data payload in a WAL data message."),
+ gettext_noop("Walsender procedure consists of a loop, reading wal_sender_max_send_size "
+ "bytes of WALs from disk and sending them to the receiver. Sending large "
+ "batches makes walsender less responsive to signals."),
+ GUC_UNIT_BYTE,
+ },
+ &wal_sender_max_send_size,
+ 16 * XLOG_BLCKSZ, XLOG_BLCKSZ, INT_MAX,
+ check_wal_sender_max_send_size, NULL, NULL
+ },
+
{
{"commit_delay", PGC_SUSET, WAL_SETTINGS,
gettext_noop("Sets the delay in microseconds between transaction commit and "
diff --git a/src/include/replication/walsender.h b/src/include/replication/walsender.h
index 1b58d50b3b..f4d3c73f4d 100644
--- a/src/include/replication/walsender.h
+++ b/src/include/replication/walsender.h
@@ -31,6 +31,7 @@ extern PGDLLIMPORT bool wake_wal_senders;
/* user-settable parameters */
extern PGDLLIMPORT int max_wal_senders;
extern PGDLLIMPORT int wal_sender_timeout;
+extern PGDLLIMPORT int wal_sender_max_send_size;
extern PGDLLIMPORT bool log_replication_commands;
extern void InitWalSender(void);
diff --git a/src/include/utils/guc_hooks.h b/src/include/utils/guc_hooks.h
index 5300c44f3b..b0ac07171c 100644
--- a/src/include/utils/guc_hooks.h
+++ b/src/include/utils/guc_hooks.h
@@ -84,6 +84,7 @@ extern bool check_maintenance_io_concurrency(int *newval, void **extra,
extern void assign_maintenance_io_concurrency(int newval, void *extra);
extern bool check_max_connections(int *newval, void **extra, GucSource source);
extern bool check_max_wal_senders(int *newval, void **extra, GucSource source);
+extern bool check_wal_sender_max_send_size(int *newval, void **extra, GucSource source);
extern bool check_max_slot_wal_keep_size(int *newval, void **extra,
GucSource source);
extern void assign_max_wal_size(int newval, void *extra);
--
2.34.1
On Fri, Jan 19, 2024 at 11:04:50PM +0330, Majid Garoosi wrote:
However, this value does not need to be larger than wal_segment_size,
thus its checker function returns false if a larger value is set for
this.This is my first patch. So, I hope I haven't done something wrong. :'D
You've done nothing wrong. Thanks for the patch!
+ if (*newval > wal_segment_size)
+ return false;
+ return true;
I was not sure first that we need a dynamic check, but I could get why
somebody may want to make it higher than 1MB these days.
The patch is missing a couple of things:
- Documentation in doc/src/sgml/config.sgml, that has a section for
"Sending Servers".
- It is not listed in postgresql.conf.sample. I would suggest to put
it in REPLICATION -> Sending Servers.
The default value of 128kB should be mentioned in both cases.
- * We don't have a good idea of what a good value would be; there's some
- * overhead per message in both walsender and walreceiver, but on the other
- * hand sending large batches makes walsender less responsive to signals
- * because signals are checked only between messages. 128kB (with
- * default 8k blocks) seems like a reasonable guess for now.
[...]
+ gettext_noop("Walsender procedure consists of a loop, reading wal_sender_max_send_size "
+ "bytes of WALs from disk and sending them to the receiver. Sending large "
+ "batches makes walsender less responsive to signals."),
This is removing some information about why it may be a bad idea to
use a too low value (message overhead) and why it may be a bad idea to
use a too large value (responsiveness). I would suggest to remove the
second gettext_noop() in guc_tables.c and move all this information to
config.sgml with the description of the new GUC. Perhaps just put it
after wal_sender_timeout in the sample file and the docs?
Three comments in walsender.c still mention MAX_SEND_SIZE. These
should be switched to mention the GUC instead.
I am switching the patch as waiting on author for now.
--
Michael
Thank you very much for your review.
I generally agree with your suggestions, so just applied them.
You can find the new patch in the attached file.
Best
Majid
On Tue, 6 Feb 2024 at 09:26, Michael Paquier <michael@paquier.xyz> wrote:
Show quoted text
On Fri, Jan 19, 2024 at 11:04:50PM +0330, Majid Garoosi wrote:
However, this value does not need to be larger than wal_segment_size,
thus its checker function returns false if a larger value is set for
this.This is my first patch. So, I hope I haven't done something wrong. :'D
You've done nothing wrong. Thanks for the patch!
+ if (*newval > wal_segment_size) + return false; + return true;I was not sure first that we need a dynamic check, but I could get why
somebody may want to make it higher than 1MB these days.The patch is missing a couple of things:
- Documentation in doc/src/sgml/config.sgml, that has a section for
"Sending Servers".
- It is not listed in postgresql.conf.sample. I would suggest to put
it in REPLICATION -> Sending Servers.
The default value of 128kB should be mentioned in both cases.- * We don't have a good idea of what a good value would be; there's some - * overhead per message in both walsender and walreceiver, but on the other - * hand sending large batches makes walsender less responsive to signals - * because signals are checked only between messages. 128kB (with - * default 8k blocks) seems like a reasonable guess for now. [...] + gettext_noop("Walsender procedure consists of a loop, reading wal_sender_max_send_size " + "bytes of WALs from disk and sending them to the receiver. Sending large " + "batches makes walsender less responsive to signals."),This is removing some information about why it may be a bad idea to
use a too low value (message overhead) and why it may be a bad idea to
use a too large value (responsiveness). I would suggest to remove the
second gettext_noop() in guc_tables.c and move all this information to
config.sgml with the description of the new GUC. Perhaps just put it
after wal_sender_timeout in the sample file and the docs?Three comments in walsender.c still mention MAX_SEND_SIZE. These
should be switched to mention the GUC instead.I am switching the patch as waiting on author for now.
--
Michael
Attachments:
v2-0001-Add-documentation-for-wal_sender_max_send_size.patchtext/x-patch; charset=US-ASCII; name=v2-0001-Add-documentation-for-wal_sender_max_send_size.patchDownload
From fdd833b8b5cfdc1645f8288b27e9a92fc46f7951 Mon Sep 17 00:00:00 2001
From: Majid Garoosi <amoomajid99@gmail.com>
Date: Fri, 19 Jan 2024 22:48:23 +0330
Subject: [PATCH v2] Add documentation for wal_sender_max_send_size
Detailed changes:
- Add documentation to postgresql's config docs
- Remove code comments mentioning the setting's old constant name
- Add the parameter to postgresql.conf.sample
---
doc/src/sgml/config.sgml | 26 ++++++++++++++++++
src/backend/replication/walsender.c | 27 +++++++------------
src/backend/utils/init/postinit.c | 11 ++++++++
src/backend/utils/misc/guc_tables.c | 11 ++++++++
src/backend/utils/misc/postgresql.conf.sample | 1 +
src/include/replication/walsender.h | 1 +
src/include/utils/guc_hooks.h | 1 +
7 files changed, 61 insertions(+), 17 deletions(-)
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 61038472c5..86b9488cd1 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4405,6 +4405,32 @@ restore_command = 'copy "C:\\server\\archivedir\\%f" "%p"' # Windows
</listitem>
</varlistentry>
+ <varlistentry id="guc-wal-sender-max-send-size" xreflabel="wal_sender_max_send_size">
+ <term><varname>wal_sender_max_send_size</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>wal_sender_max_send_size</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Maximum size of chunks that wal sender reads from disk and sends to the
+ standby servers in each iteration. Its default value is <literal>16</literal>
+ times <literal>XLOG_BLCKSZ</literal>, i.e. it is <literal>128kB</literal> if
+ the project is built with the default <literal>XLOG_BLCKSZ</literal>
+ configuration parameter. This parameter can be changed without restarting the
+ server.
+ </para>
+ <para>
+ The minimum allowed value is <literal>XLOG_BLCKSZ</literal> bytes, typically
+ <literal>8kB</literal>. Its maximum allowed value is
+ <varname>wal_segment_size</varname>, typically <literal>16MB</literal>.
+ While a large value for this parameter reduces the responsiveness of WAL
+ sender processes to the signals, setting small values increases the overhead
+ of the messages both in WAL senders and WAL receivers.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-track-commit-timestamp" xreflabel="track_commit_timestamp">
<term><varname>track_commit_timestamp</varname> (<type>boolean</type>)
<indexterm>
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 087031e9dc..c429facded 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -96,17 +96,6 @@
#include "utils/timeout.h"
#include "utils/timestamp.h"
-/*
- * Maximum data payload in a WAL data message. Must be >= XLOG_BLCKSZ.
- *
- * We don't have a good idea of what a good value would be; there's some
- * overhead per message in both walsender and walreceiver, but on the other
- * hand sending large batches makes walsender less responsive to signals
- * because signals are checked only between messages. 128kB (with
- * default 8k blocks) seems like a reasonable guess for now.
- */
-#define MAX_SEND_SIZE (XLOG_BLCKSZ * 16)
-
/* Array of WalSnds in shared memory */
WalSndCtlData *WalSndCtl = NULL;
@@ -124,6 +113,9 @@ int max_wal_senders = 10; /* the maximum number of concurrent
* walsenders */
int wal_sender_timeout = 60 * 1000; /* maximum time to send one WAL
* data message */
+int wal_sender_max_send_size = XLOG_BLCKSZ * 16; /* Maximum data
+ * payload in a WAL
+ * data message */
bool log_replication_commands = false;
/*
@@ -2894,8 +2886,8 @@ WalSndSegmentOpen(XLogReaderState *state, XLogSegNo nextSegNo,
/*
* Send out the WAL in its normal physical/stored form.
*
- * Read up to MAX_SEND_SIZE bytes of WAL that's been flushed to disk,
- * but not yet sent to the client, and buffer it in the libpq output
+ * Read up to wal_sender_max_send_size bytes of WAL that's been flushed to
+ * disk, but not yet sent to the client, and buffer it in the libpq output
* buffer.
*
* If there is no unsent WAL remaining, WalSndCaughtUp is set to true,
@@ -3076,8 +3068,9 @@ XLogSendPhysical(void)
/*
* Figure out how much to send in one message. If there's no more than
- * MAX_SEND_SIZE bytes to send, send everything. Otherwise send
- * MAX_SEND_SIZE bytes, but round back to logfile or page boundary.
+ * wal_sender_max_send_size bytes to send, send everything. Otherwise send
+ * wal_sender_max_send_size bytes, but round back to logfile or page
+ * boundary.
*
* The rounding is not only for performance reasons. Walreceiver relies on
* the fact that we never split a WAL record across two messages. Since a
@@ -3087,7 +3080,7 @@ XLogSendPhysical(void)
*/
startptr = sentPtr;
endptr = startptr;
- endptr += MAX_SEND_SIZE;
+ endptr += wal_sender_max_send_size;
/* if we went beyond SendRqstPtr, back off */
if (SendRqstPtr <= endptr)
@@ -3106,7 +3099,7 @@ XLogSendPhysical(void)
}
nbytes = endptr - startptr;
- Assert(nbytes <= MAX_SEND_SIZE);
+ Assert(nbytes <= wal_sender_max_send_size);
/*
* OK to read and send the slice.
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 1ad3367159..8d61b006c4 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -617,6 +617,17 @@ check_max_wal_senders(int *newval, void **extra, GucSource source)
return true;
}
+/*
+ * GUC check_hook for wal_sender_max_send_size
+ */
+bool
+check_wal_sender_max_send_size(int *newval, void **extra, GucSource source)
+{
+ if (*newval > wal_segment_size)
+ return false;
+ return true;
+}
+
/*
* Early initialization of a backend (either standalone or under postmaster).
* This happens even before InitPostgres.
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index e53ebc6dc2..0ecd8b8940 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2892,6 +2892,17 @@ struct config_int ConfigureNamesInt[] =
NULL, NULL, NULL
},
+ {
+ {"wal_sender_max_send_size", PGC_SIGHUP, REPLICATION_SENDING,
+ gettext_noop("Sets size of the maximum data payload in a WAL data message."),
+ NULL,
+ GUC_UNIT_BYTE,
+ },
+ &wal_sender_max_send_size,
+ 16 * XLOG_BLCKSZ, XLOG_BLCKSZ, INT_MAX,
+ check_wal_sender_max_send_size, NULL, NULL
+ },
+
{
{"commit_delay", PGC_SUSET, WAL_SETTINGS,
gettext_noop("Sets the delay in microseconds between transaction commit and "
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 835b0e9ba8..bc393245e1 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -323,6 +323,7 @@
#wal_keep_size = 0 # in megabytes; 0 disables
#max_slot_wal_keep_size = -1 # in megabytes; -1 disables
#wal_sender_timeout = 60s # in milliseconds; 0 disables
+#wal_sender_max_send_size = 128kB # in bytes
#track_commit_timestamp = off # collect timestamp of transaction commit
# (change requires restart)
diff --git a/src/include/replication/walsender.h b/src/include/replication/walsender.h
index 1b58d50b3b..f4d3c73f4d 100644
--- a/src/include/replication/walsender.h
+++ b/src/include/replication/walsender.h
@@ -31,6 +31,7 @@ extern PGDLLIMPORT bool wake_wal_senders;
/* user-settable parameters */
extern PGDLLIMPORT int max_wal_senders;
extern PGDLLIMPORT int wal_sender_timeout;
+extern PGDLLIMPORT int wal_sender_max_send_size;
extern PGDLLIMPORT bool log_replication_commands;
extern void InitWalSender(void);
diff --git a/src/include/utils/guc_hooks.h b/src/include/utils/guc_hooks.h
index 5300c44f3b..b0ac07171c 100644
--- a/src/include/utils/guc_hooks.h
+++ b/src/include/utils/guc_hooks.h
@@ -84,6 +84,7 @@ extern bool check_maintenance_io_concurrency(int *newval, void **extra,
extern void assign_maintenance_io_concurrency(int newval, void *extra);
extern bool check_max_connections(int *newval, void **extra, GucSource source);
extern bool check_max_wal_senders(int *newval, void **extra, GucSource source);
+extern bool check_wal_sender_max_send_size(int *newval, void **extra, GucSource source);
extern bool check_max_slot_wal_keep_size(int *newval, void **extra,
GucSource source);
extern void assign_max_wal_size(int newval, void *extra);
--
2.34.1
On Thu, Feb 08, 2024 at 02:42:00PM +0330, Majid Garoosi wrote:
Thank you very much for your review.
Something to be aware of, but the community lists use bottom-posting
for replies because it is easier to follow the logic of a thread this
way. See here:
https://en.wikipedia.org/wiki/Posting_style#Bottom-posting
I generally agree with your suggestions, so just applied them.
You can find the new patch in the attached file.
Thanks for the patch, that looks rather fine. I have spent some time
polishing the docs, adding a mention that increasing the value can
show benefits depending on what you do. How does the attached look to
you?
--
Michael
Attachments:
v3-0001-Introduce-wal_sender_max_send_size-as-a-GUC.patchtext/x-diff; charset=us-asciiDownload
From 2bacd4e4aaff53d26d8017e905e1dd02a67e93ff Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Fri, 9 Feb 2024 14:18:31 +0900
Subject: [PATCH v3] Introduce wal_sender_max_send_size as a GUC
Blah, blah.
Author: Majid Garoosi
Discussion: https://postgr.es/m/
---
src/include/replication/walsender.h | 1 +
src/include/utils/guc_hooks.h | 1 +
src/backend/replication/walsender.c | 26 +++++++------------
src/backend/utils/init/postinit.c | 11 ++++++++
src/backend/utils/misc/guc_tables.c | 11 ++++++++
src/backend/utils/misc/postgresql.conf.sample | 1 +
doc/src/sgml/config.sgml | 23 ++++++++++++++++
7 files changed, 57 insertions(+), 17 deletions(-)
diff --git a/src/include/replication/walsender.h b/src/include/replication/walsender.h
index 1b58d50b3b..f4d3c73f4d 100644
--- a/src/include/replication/walsender.h
+++ b/src/include/replication/walsender.h
@@ -31,6 +31,7 @@ extern PGDLLIMPORT bool wake_wal_senders;
/* user-settable parameters */
extern PGDLLIMPORT int max_wal_senders;
extern PGDLLIMPORT int wal_sender_timeout;
+extern PGDLLIMPORT int wal_sender_max_send_size;
extern PGDLLIMPORT bool log_replication_commands;
extern void InitWalSender(void);
diff --git a/src/include/utils/guc_hooks.h b/src/include/utils/guc_hooks.h
index 5300c44f3b..b0ac07171c 100644
--- a/src/include/utils/guc_hooks.h
+++ b/src/include/utils/guc_hooks.h
@@ -84,6 +84,7 @@ extern bool check_maintenance_io_concurrency(int *newval, void **extra,
extern void assign_maintenance_io_concurrency(int newval, void *extra);
extern bool check_max_connections(int *newval, void **extra, GucSource source);
extern bool check_max_wal_senders(int *newval, void **extra, GucSource source);
+extern bool check_wal_sender_max_send_size(int *newval, void **extra, GucSource source);
extern bool check_max_slot_wal_keep_size(int *newval, void **extra,
GucSource source);
extern void assign_max_wal_size(int newval, void *extra);
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 77c8baa32a..a55516f589 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -96,17 +96,6 @@
#include "utils/timeout.h"
#include "utils/timestamp.h"
-/*
- * Maximum data payload in a WAL data message. Must be >= XLOG_BLCKSZ.
- *
- * We don't have a good idea of what a good value would be; there's some
- * overhead per message in both walsender and walreceiver, but on the other
- * hand sending large batches makes walsender less responsive to signals
- * because signals are checked only between messages. 128kB (with
- * default 8k blocks) seems like a reasonable guess for now.
- */
-#define MAX_SEND_SIZE (XLOG_BLCKSZ * 16)
-
/* Array of WalSnds in shared memory */
WalSndCtlData *WalSndCtl = NULL;
@@ -124,6 +113,8 @@ int max_wal_senders = 10; /* the maximum number of concurrent
* walsenders */
int wal_sender_timeout = 60 * 1000; /* maximum time to send one WAL
* data message */
+int wal_sender_max_send_size = XLOG_BLCKSZ * 16; /* Maximum data payload
+ * in a WAL data message */
bool log_replication_commands = false;
/*
@@ -2950,8 +2941,8 @@ WalSndSegmentOpen(XLogReaderState *state, XLogSegNo nextSegNo,
/*
* Send out the WAL in its normal physical/stored form.
*
- * Read up to MAX_SEND_SIZE bytes of WAL that's been flushed to disk,
- * but not yet sent to the client, and buffer it in the libpq output
+ * Read up to wal_sender_max_send_size bytes of WAL that's been flushed to
+ * disk, but not yet sent to the client, and buffer it in the libpq output
* buffer.
*
* If there is no unsent WAL remaining, WalSndCaughtUp is set to true,
@@ -3132,8 +3123,9 @@ XLogSendPhysical(void)
/*
* Figure out how much to send in one message. If there's no more than
- * MAX_SEND_SIZE bytes to send, send everything. Otherwise send
- * MAX_SEND_SIZE bytes, but round back to logfile or page boundary.
+ * wal_sender_max_send_size bytes to send, send everything. Otherwise send
+ * wal_sender_max_send_size bytes, but round back to logfile or page
+ * boundary.
*
* The rounding is not only for performance reasons. Walreceiver relies on
* the fact that we never split a WAL record across two messages. Since a
@@ -3143,7 +3135,7 @@ XLogSendPhysical(void)
*/
startptr = sentPtr;
endptr = startptr;
- endptr += MAX_SEND_SIZE;
+ endptr += wal_sender_max_send_size;
/* if we went beyond SendRqstPtr, back off */
if (SendRqstPtr <= endptr)
@@ -3162,7 +3154,7 @@ XLogSendPhysical(void)
}
nbytes = endptr - startptr;
- Assert(nbytes <= MAX_SEND_SIZE);
+ Assert(nbytes <= wal_sender_max_send_size);
/*
* OK to read and send the slice.
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 1ad3367159..8d61b006c4 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -617,6 +617,17 @@ check_max_wal_senders(int *newval, void **extra, GucSource source)
return true;
}
+/*
+ * GUC check_hook for wal_sender_max_send_size
+ */
+bool
+check_wal_sender_max_send_size(int *newval, void **extra, GucSource source)
+{
+ if (*newval > wal_segment_size)
+ return false;
+ return true;
+}
+
/*
* Early initialization of a backend (either standalone or under postmaster).
* This happens even before InitPostgres.
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 7fe58518d7..eff414fd18 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2902,6 +2902,17 @@ struct config_int ConfigureNamesInt[] =
NULL, NULL, NULL
},
+ {
+ {"wal_sender_max_send_size", PGC_SIGHUP, REPLICATION_SENDING,
+ gettext_noop("Sets size of the maximum data payload in a WAL data message."),
+ NULL,
+ GUC_UNIT_BYTE,
+ },
+ &wal_sender_max_send_size,
+ 16 * XLOG_BLCKSZ, XLOG_BLCKSZ, INT_MAX,
+ check_wal_sender_max_send_size, NULL, NULL
+ },
+
{
{"commit_delay", PGC_SUSET, WAL_SETTINGS,
gettext_noop("Sets the delay in microseconds between transaction commit and "
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index da10b43dac..cb430d6e48 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -323,6 +323,7 @@
#wal_keep_size = 0 # in megabytes; 0 disables
#max_slot_wal_keep_size = -1 # in megabytes; -1 disables
#wal_sender_timeout = 60s # in milliseconds; 0 disables
+#wal_sender_max_send_size = 128kB # min 8kB
#track_commit_timestamp = off # collect timestamp of transaction commit
# (change requires restart)
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 61038472c5..4f4285aed8 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4405,6 +4405,29 @@ restore_command = 'copy "C:\\server\\archivedir\\%f" "%p"' # Windows
</listitem>
</varlistentry>
+ <varlistentry id="guc-wal-sender-max-send-size" xreflabel="wal_sender_max_send_size">
+ <term><varname>wal_sender_max_send_size</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>wal_sender_max_send_size</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Set the maximum size of a message sent by a WAL sender process during
+ physical replication. The default value is 16 times the WAL block size,
+ i.e. <literal>128kB</literal>. This parameter can be changed without
+ restarting the server.
+ </para>
+ <para>
+ Increasing this value can show performance benefits in some
+ environments. A larger value reduces the responsiveness of WAL sender
+ processes to signals as these are checked only between messages.
+ Setting smaller values increases the overhead of the messages both
+ in WAL senders and WAL receivers.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-track-commit-timestamp" xreflabel="track_commit_timestamp">
<term><varname>track_commit_timestamp</varname> (<type>boolean</type>)
<indexterm>
--
2.43.0
On Fri, 9 Feb 2024 at 08:50, Michael Paquier <michael@paquier.xyz> wrote:
Something to be aware of, but the community lists use bottom-posting
for replies because it is easier to follow the logic of a thread this
way. See here:
https://en.wikipedia.org/wiki/Posting_style#Bottom-posting
Oh, sorry for not using the convention here. I just noticed that after
pressing the send button. =)
Thanks for the patch, that looks rather fine. I have spent some time
polishing the docs, adding a mention that increasing the value can
show benefits depending on what you do. How does the attached look to
you?
I took a look at it and it seems good to me.
Maybe I should work on my writing skills more. :D
Best
Majid
Hi,
On 2024-01-19 23:04:50 +0330, Majid Garoosi wrote:
Following is the description which is also written in the commit message:
MAX_SEND_SIZE parameter was used in WALSender to limit maximum size of
a WAL data packet sent to a WALReceiver during replication. Although
its default value (128kB) was a reasonable value, it was written in
2010. Since the hardwares have changed a lot since then, a PostgreSQL
user might need to customize this value.
For example, if a database's disk has a high bandwidth and a high
latency at the same time, it makes more sense to read bigger chunks of
data from disk in each iteration. One example of such disks is a remote
disk. (e.g. an RBD volume)
The way we read the WAL data is perfectly prefetchable by the the OS, so I
wouldn't really expect gains here. Have you actually been able to see a
performance benefit by increasing MAX_SEND_SIZE?
I don't think we should add configuration parameters without at least some
demonstration of practical gains, otherwise we'll end up with hundreds of
never-useful config options.
Greetings,
Andres Freund
Hi Andres,
On Fri, 9 Feb 2024 at 22:33, Andres Freund <andres@anarazel.de> wrote:
On 2024-01-19 23:04:50 +0330, Majid Garoosi wrote:
Following is the description which is also written in the commit message:
MAX_SEND_SIZE parameter was used in WALSender to limit maximum size of
a WAL data packet sent to a WALReceiver during replication. Although
its default value (128kB) was a reasonable value, it was written in
2010. Since the hardwares have changed a lot since then, a PostgreSQL
user might need to customize this value.
For example, if a database's disk has a high bandwidth and a high
latency at the same time, it makes more sense to read bigger chunks of
data from disk in each iteration. One example of such disks is a remote
disk. (e.g. an RBD volume)The way we read the WAL data is perfectly prefetchable by the the OS, so I
wouldn't really expect gains here. Have you actually been able to see a
performance benefit by increasing MAX_SEND_SIZE?
Yes, I have seen a huge performance jump. Take a look at here
</messages/by-id/CAFWczPvi_5FWH+JTqkWbi+w83hy=MYg=2hKK0=JZBe9=hTpE4w@mail.gmail.com>
for
more info.
Best
Majid
On Sun, Feb 11, 2024 at 04:32:20PM +0330, Majid Garoosi wrote:
On Fri, 9 Feb 2024 at 22:33, Andres Freund <andres@anarazel.de> wrote:
The way we read the WAL data is perfectly prefetchable by the the OS, so I
wouldn't really expect gains here. Have you actually been able to see a
performance benefit by increasing MAX_SEND_SIZE?Yes, I have seen a huge performance jump. Take a look at here
</messages/by-id/CAFWczPvi_5FWH+JTqkWbi+w83hy=MYg=2hKK0=JZBe9=hTpE4w@mail.gmail.com>
for
more info.
Yes, I can get the idea that grouping more replication messages in
one shot can be beneficial in some cases while being
environment-dependent, though I also get the point that we cannot
simply GUC-ify everything either. I'm OK with this one at the end,
because it is not performance critical.
Note that it got lowered to the current value in ea5516081dcb to make
it more responsive, while being half a WAL segment in 40f908bdcdc7
when WAL senders have been introduced in 2010. I cannot pinpoint the
exact thread that led to this change, but I'm adding Fujii-san and
Heikki in CC for comments.
--
Michael
Hey folks,
Any news, comments, etc. about this thread?
Best regards
Majid Garoosi
On Mon, 12 Feb 2024 at 01:10, Michael Paquier <michael@paquier.xyz> wrote:
Show quoted text
On Sun, Feb 11, 2024 at 04:32:20PM +0330, Majid Garoosi wrote:
On Fri, 9 Feb 2024 at 22:33, Andres Freund <andres@anarazel.de> wrote:
The way we read the WAL data is perfectly prefetchable by the the OS,
so I
wouldn't really expect gains here. Have you actually been able to see a
performance benefit by increasing MAX_SEND_SIZE?Yes, I have seen a huge performance jump. Take a look at here
</messages/by-id/CAFWczPvi_5FWH+JTqkWbi+w83hy=MYg=2hKK0=JZBe9=hTpE4w@mail.gmail.com
for
more info.Yes, I can get the idea that grouping more replication messages in
one shot can be beneficial in some cases while being
environment-dependent, though I also get the point that we cannot
simply GUC-ify everything either. I'm OK with this one at the end,
because it is not performance critical.Note that it got lowered to the current value in ea5516081dcb to make
it more responsive, while being half a WAL segment in 40f908bdcdc7
when WAL senders have been introduced in 2010. I cannot pinpoint the
exact thread that led to this change, but I'm adding Fujii-san and
Heikki in CC for comments.
--
Michael
On Mon, Apr 22, 2024 at 03:40:01PM +0200, Majid Garoosi wrote:
Any news, comments, etc. about this thread?
FWIW, I'd still be in favor of doing a GUC-ification of this part, but
at this stage I'd need more time to do a proper study of a case where
this shows benefits to prove your point, or somebody else could come
in and show it.
Andres has objected to this change, on the ground that this was not
worth it, though you are telling the contrary. I would be curious to
hear from others, first, so as we gather more opinions to reach a
consensus.
--
Michael
On Tue, Apr 23, 2024 at 2:24 AM Michael Paquier <michael@paquier.xyz> wrote:
On Mon, Apr 22, 2024 at 03:40:01PM +0200, Majid Garoosi wrote:
Any news, comments, etc. about this thread?
FWIW, I'd still be in favor of doing a GUC-ification of this part, but
at this stage I'd need more time to do a proper study of a case where
this shows benefits to prove your point, or somebody else could come
in and show it.Andres has objected to this change, on the ground that this was not
worth it, though you are telling the contrary. I would be curious to
hear from others, first, so as we gather more opinions to reach a
consensus.
I'm more with Andres on this one.I vaguely remember researching impact
of MAX_SEND_SIZE on independent two occasions (earlier async and more
recent sync case where I've investigated variety of ways to keep
latency down) and my summary would be:
First: it's very hard to get *reliable* replication setup for
benchmark, where one could demonstrate correlation between e.g.
increasing MAX_SEND_SIZE and observing benefits (in sync rep it is
easier, as you are simply stalled in pgbench). Part of the problem are
the following things:
a) workload can be tricky, for this purpose it needs to be trivial but bulky
b) it needs to be on isolated network and with guaranteed bandwidth
c) wal_init_zero impact should be ruled out
d) OS should be properly tuned autotuning TCP max(3rd value) + have
setup rmem_max/wmem_max properly
e) more serious TCP congestion should be used that the default one in OS
f) one should prevent any I/O stalls on walreceiver writeback during
huge WAL activity and restart points on standby (dirty_bytes and so
on, isolated pg_xlog, BDI max_ratio)
Second: once you perform above and ensure that there are no network or
I/O stalls back then I *think* I couldn't see any impact of playing
with MAX_SEND_SIZE from what I remember as probably something else is
saturated first.
I can offer help with trying to test this with artificial tests and
even injecting proper latency (WAN-like), but OP (Majid) I think needs
first describe his env much better (exact latency, bandwidth,
workload, TCP sysctl values, duration of the tests, no# of attempts
tried, exact commands used, iperf3 TCP results demonstrating hw used
and so on). So in short the patch is easy, but demonstrating the
effect and persuading others here would be hard.
-J.
Hi,
On 2024-04-23 14:47:31 +0200, Jakub Wartak wrote:
On Tue, Apr 23, 2024 at 2:24 AM Michael Paquier <michael@paquier.xyz> wrote:
Any news, comments, etc. about this thread?
FWIW, I'd still be in favor of doing a GUC-ification of this part, but
at this stage I'd need more time to do a proper study of a case where
this shows benefits to prove your point, or somebody else could come
in and show it.Andres has objected to this change, on the ground that this was not
worth it, though you are telling the contrary. I would be curious to
hear from others, first, so as we gather more opinions to reach a
consensus.
I think it's a bad idea to make it configurable. It's just one more guc that
nobody has a chance of realistically tuning. I'm not saying we shouldn't
improve the code - just that making MAX_SEND_SIZE configurable doesn't really
seem like a good answer.
FWIW, I have a hard time believing that MAX_SEND_SIZE is going to be the the
only or even primary issue with high latency, high bandwidth storage devices.
First: it's very hard to get *reliable* replication setup for
benchmark, where one could demonstrate correlation between e.g.
increasing MAX_SEND_SIZE and observing benefits (in sync rep it is
easier, as you are simply stalled in pgbench). Part of the problem are
the following things:
Depending on the workload, it's possible to measure streaming-out performance
without actually regenerating WAL. E.g. by using pg_receivewal to stream the
data out multiple times.
Another way to get fairly reproducible WAL workloads is to drive
pg_logical_emit_message() from pgbench, that tends to havea lot less
variability than running tpcb-like or such.
Second: once you perform above and ensure that there are no network or
I/O stalls back then I *think* I couldn't see any impact of playing
with MAX_SEND_SIZE from what I remember as probably something else is
saturated first.
My understanding of Majid's use-case for tuning MAX_SEND_SIZE is that the
bottleneck is storage, not network. The reason MAX_SEND_SIZE affects that is
that it determines the max size passed to WALRead(), which in turn determines
how much we read from the OS at once. If the storage has high latency but
also high throughput, and readahead is disabled or just not aggressive enough
after crossing segment boundaries, larger reads reduce the number of times
you're likely to be blocked waiting for read IO.
Which is also why I think that making MAX_SEND_SIZE configurable is a really
poor proxy for improving the situation.
We're imo much better off working on read_stream.[ch] support for reading WAL.
Greetings,
Andres Freund
Hi,
My understanding of Majid's use-case for tuning MAX_SEND_SIZE is that the
bottleneck is storage, not network. The reason MAX_SEND_SIZE affects that is
that it determines the max size passed to WALRead(), which in turn determines
how much we read from the OS at once. If the storage has high latency but
also high throughput, and readahead is disabled or just not aggressive enough
after crossing segment boundaries, larger reads reduce the number of times
you're likely to be blocked waiting for read IO.Which is also why I think that making MAX_SEND_SIZE configurable is a really
poor proxy for improving the situation.We're imo much better off working on read_stream.[ch] support for reading WAL.
Well then that would be a consistent message at least, because earlier
in [1]/messages/by-id/CADVKa1WsQMBYK_02_Ji=pbOFnms+CT7TV6qJxDdHsFCiC9V_hw@mail.gmail.com it was rejected to have prefetch the WAL segment but on the
standby side, where the patch was only helping in configurations
having readahead *disabled* for some reason.
Now Majid stated that he uses "RBD" - Majid, any chance to specify
what that RBD really is ? What's the tech? What fs? Any ioping or fio
results? What's the blockdev --report /dev/XXX output ? (you stated
"high" latency and "high" bandwidth , but it is relative, for me 15ms+
is high latency and >1000MB/s sequential, but it would help others in
future if you could specify it by the exact numbers please). Maybe
it's just a matter of enabling readahead (line in [1]/messages/by-id/CADVKa1WsQMBYK_02_Ji=pbOFnms+CT7TV6qJxDdHsFCiC9V_hw@mail.gmail.com) there and/or
using a higher WAL segment during initdb.
[1]: /messages/by-id/CADVKa1WsQMBYK_02_Ji=pbOFnms+CT7TV6qJxDdHsFCiC9V_hw@mail.gmail.com
Hi,
Now Majid stated that he uses "RBD" - Majid, any chance to specify
what that RBD really is ? What's the tech? What fs? Any ioping or fio
results? What's the blockdev --report /dev/XXX output ? (you stated
"high" latency and "high" bandwidth , but it is relative, for me 15ms+
is high latency and >1000MB/s sequential, but it would help others in
future if you could specify it by the exact numbers please). Maybe
it's just a matter of enabling readahead (line in [1]) there and/or
using a higher WAL segment during initdb.
Unfortunately, I quit that company a month ago (I wish we could
discuss this earlier) and don't have access to the environment
anymore.
I'll try to ask my teammates and see if they can find anything about
the exact values of latency, bw, etc.
Anyway, here is some description of the environment. Sadly, there
are no numbers in this description, but I'll try to provide as much details
as possible.
There is a k8s cluster running over some VMs. Each postgres
instance runs as a pod inside the k8s cluster. So, both the
primary and standby servers are in the same DC, but might be on
different baremetal nodes. There is an overlay network for the pods to
see each other, and there's also another overlay network for the VMs
to see each other.
The storage servers are in the same DC, but we're sure they're on some
racks other than the postgres pods. They run Ceph [1]https://ceph.io/en/ project and provide
Rados Block Devices (RBD) [2]https://docs.ceph.com/en/latest/rbd/ interface. In order for k8s to use ceph, a
Ceph-CSI [3]https://github.com/ceph/ceph-csi controller is running inside the k8s cluster.
BTW, the FS type is ext4.
[1]: https://ceph.io/en/
[2]: https://docs.ceph.com/en/latest/rbd/
[3]: https://github.com/ceph/ceph-csi
On Thu, Apr 25, 2024 at 02:53:33PM +0200, Majid Garoosi wrote:
Unfortunately, I quit that company a month ago (I wish we could
discuss this earlier) and don't have access to the environment
anymore.
I'll try to ask my teammates and see if they can find anything about
the exact values of latency, bw, etc.Anyway, here is some description of the environment. Sadly, there
are no numbers in this description, but I'll try to provide as much details
as possible.
There is a k8s cluster running over some VMs. Each postgres
instance runs as a pod inside the k8s cluster. So, both the
primary and standby servers are in the same DC, but might be on
different baremetal nodes. There is an overlay network for the pods to
see each other, and there's also another overlay network for the VMs
to see each other.
The storage servers are in the same DC, but we're sure they're on some
racks other than the postgres pods. They run Ceph [1] project and provide
Rados Block Devices (RBD) [2] interface. In order for k8s to use ceph, a
Ceph-CSI [3] controller is running inside the k8s cluster.
BTW, the FS type is ext4.
Okay, seeing the feedback for this patch and Andres disagreeing with
it as being a good idea, I have marked the patch as rejected as it was
still in the CF app.
--
Michael