Redux: Throttle WAL inserts before commit

Started by Shirisha Shirishaover 1 year ago2 messages
#1Shirisha Shirisha
shirisha.sn@broadcom.com
1 attachment(s)

Hello hackers,

This is an attempt to resurrect the thread [1]/messages/by-id/CAHg+QDcO_zhgBCMn5SosvhuuCoJ1vKmLjnVuqUEOd4S73B1urw@mail.gmail.com to throttle WAL inserts
before the point of commit.

Background:

Transactions on commit, wait for replication and make sure WAL is
flushed up to commit lsn on standby, when synchronous_commit is on.

While commit is a mandatory sync/wait point, waiting for replication at
some periodic intervals en route may be desirable/efficient to act as
good citizen. Consider for example, a setup where primary and standby
can write at 20GB/sec, while network between them can only transfer at
2GB/sec. Now if CTAS is run in such a setup for a large table, it can
generate WAL very aggressively on primary, but can't be transferred at
that rate to standby. Hence, there would be pending WAL build-up on
primary. This exhibits two main things:

- Fairness: new write transactions (even if single tuple I/U/D), and
even read transactions (setting hint bits) would exhibit latency for
amount of time equivalent to the pending WAL to be shipped and
flushed to standby.

- Primary needs to have space to hold that much WAL, since till the WAL
is not shipped to standby, it can't be recycled, if replication slots
are in use.

Proposed solution (patch attached):

- Global (backend local) variable wal_bytes_written to track the amount
of wal written by the backend since the start of transaction or the
last time SyncReplWaitForLSN() was called for this transaction.

- Whenever we find wal_bytes_written exceeds the new
wait_for_replication_threshold GUC, we set the control flag
XlogThrottlePending (similar in spirit to LogMemoryContextPending),
which is then handled at ProcessInterrupts() time. This is the
mechanism proposed in [2]/messages/by-id/20220105174643.lozdd3radxv4tlmx@alap3.anarazel.de. Doing it this way avoids issues such as
holding locks inside a critical section.

- To do the wait itself, we rely on SyncRepWaitForLSN(), with the cached
value of the WAL flush point.

[1]: /messages/by-id/CAHg+QDcO_zhgBCMn5SosvhuuCoJ1vKmLjnVuqUEOd4S73B1urw@mail.gmail.com
[2]: /messages/by-id/20220105174643.lozdd3radxv4tlmx@alap3.anarazel.de

Regards,
Shirisha
Broadcom Inc.

--
This electronic communication and the information and any files transmitted
with it, or attached to it, are confidential and are intended solely for
the use of the individual or entity to whom it is addressed and may contain
information that is confidential, legally privileged, protected by privacy
laws, or otherwise restricted from disclosure to anyone else. If you are
not the intended recipient or the person responsible for delivering the
e-mail to the intended recipient, you are hereby notified that any use,
copying, distributing, dissemination, forwarding, printing, or copying of
this e-mail is strictly prohibited. If you received this e-mail in error,
please return the e-mail to the sender, delete it from your computer, and
destroy any printed copy of it.

Attachments:

v1-0001-WAL-throttling-mechanism-for-synchronous-replicat.patchapplication/octet-stream; name=v1-0001-WAL-throttling-mechanism-for-synchronous-replicat.patchDownload
From e57400fe673b4f64eb7fc9e9ac75c344a2e70fc0 Mon Sep 17 00:00:00 2001
From: Shirisha SN <sshirisha@vmware.com>
Date: Tue, 27 Aug 2024 12:45:08 +0530
Subject: [PATCH v1 1/1] WAL throttling mechanism for synchronous replication
 based on amount of WAL written
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Background:

Transactions on commit, wait for replication and make sure WAL is
flushed up to commit lsn on standby, when synchronous_commit is on.

While commit is a mandatory sync/wait point, waiting for replication at
some periodic intervals en route may be desirable/efficient to act as
good citizen. Consider for example, a setup where primary and standby
can write at 20GB/sec, while network between them can only transfer at
2GB/sec. Now if CTAS is run in such a setup for a large table, it can
generate WAL very aggressively on primary, but can't be transferred at
that rate to standby. Hence, there would be pending WAL build-up on
primary. This exhibits two main things:

- Fairness: new write transactions (even if single tuple I/U/D), and
  even read transactions (setting hint bits) would exhibit latency for
  amount of time equivalent to the pending WAL to be shipped and
  flushed to standby.

- Primary needs to have space to hold that much WAL, since till the WAL
  is not shipped to standby, it can't be recycled, if replication slots
  are in use.

Proposed solution (patch attached):

- Global (backend local) variable wal_bytes_written to track the amount
  of wal written by the backend since the start of transaction or the
  last time SyncReplWaitForLSN() was called for this transaction.

- Whenever we find wal_bytes_written exceeds the new
  wait_for_replication_threshold GUC, we set the control flag
  XlogThrottlePending (similar in spirit to LogMemoryContextPending),
  which is then handled at ProcessInterrupts() time. This is the
  mechanism proposed in [1]. Doing it this way avoids issues such as
  holding locks inside a critical section.

- To do the wait itself, we rely on SyncRepWaitForLSN(), with the cached
  value of the WAL flush point.

Co-authored-by: Ashwin Agrawal <aashwin@vmware.com>
Co-authored-by: Soumyadeep Chakraborty <soumyadeep2007@gmail.com>

[1] https://www.postgresql.org/message-id/20220105174643.lozdd3radxv4tlmx%40alap3.anarazel.de

Co-authored-by: Ashwin Agrawal <aashwin@vmware.com>
Co-authored-by: Soumyadeep Chakraborty <soumyadeep2007@gmail.com>
---
 src/backend/access/transam/xact.c       |  3 +
 src/backend/access/transam/xlog.c       | 33 ++++++++++
 src/backend/access/transam/xloginsert.c |  8 +++
 src/backend/tcop/postgres.c             |  7 +++
 src/backend/utils/init/globals.c        |  1 +
 src/backend/utils/misc/guc_tables.c     | 12 ++++
 src/include/access/xlog.h               |  4 ++
 src/include/miscadmin.h                 |  1 +
 src/test/recovery/t/044_wal_throttle.pl | 82 +++++++++++++++++++++++++
 9 files changed, 151 insertions(+)
 create mode 100644 src/test/recovery/t/044_wal_throttle.pl

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index dfc8cf2dcf..17e4c6910e 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2204,6 +2204,9 @@ StartTransaction(void)
 	if (TransactionTimeout > 0)
 		enable_timeout_after(TRANSACTION_TIMEOUT, TransactionTimeout);
 
+	/* Initialize wal_bytes_written */
+	wal_bytes_written = 0;
+
 	ShowTransactionState("StartTransaction");
 }
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index ee0fb0e28f..679bd8ff3a 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -82,6 +82,7 @@
 #include "replication/origin.h"
 #include "replication/slot.h"
 #include "replication/snapbuild.h"
+#include "replication/syncrep.h"
 #include "replication/walreceiver.h"
 #include "replication/walsender.h"
 #include "storage/bufmgr.h"
@@ -134,6 +135,7 @@ int			wal_retrieve_retry_interval = 5000;
 int			max_slot_wal_keep_size_mb = -1;
 int			wal_decode_buffer_size = 512 * 1024;
 bool		track_wal_io_timing = false;
+int         rep_lag_avoidance_threshold = 0;
 
 #ifdef WAL_DEBUG
 bool		XLOG_DEBUG = false;
@@ -164,6 +166,15 @@ static double PrevCheckPointDistance = 0;
  */
 static bool check_wal_consistency_checking_deferred = false;
 
+/*
+ * This is used to track how much xlog has been written by this backend, since
+ * start of transaction or last time SyncReplWaitForLSN() was called for this
+ * transaction. Currently, this is used to check if replication lag avoidance
+ * threshold has reached and if its time to wait for replication before moving
+ * forward for this transaction.
+ */
+uint64_t wal_bytes_written = 0;
+
 /*
  * GUC support
  */
@@ -921,6 +932,7 @@ XLogInsertRecord(XLogRecData *rdata,
 		CopyXLogRecordToWAL(rechdr->xl_tot_len,
 							class == WALINSERT_SPECIAL_SWITCH, rdata,
 							StartPos, EndPos, insertTLI);
+		wal_bytes_written += rechdr->xl_tot_len;
 
 		/*
 		 * Unless record is flagged as not important, update LSN of last
@@ -9480,3 +9492,24 @@ SetWalWriterSleeping(bool sleeping)
 	XLogCtl->WalWriterSleeping = sleeping;
 	SpinLockRelease(&XLogCtl->info_lck);
 }
+
+/*
+ * This function checks if the current transaction has written above
+ * rep_lag_avoidance_threshold bytes, and waits for the standby and
+ * enters synrep wait if so.
+ */
+void
+wait_to_avoid_large_repl_lag(void)
+{
+	Assert(rep_lag_avoidance_threshold);
+
+	if (wal_bytes_written > (rep_lag_avoidance_threshold * 1024))
+	{
+		HOLD_INTERRUPTS();
+		/* we use local cached copy of LogwrtResult here */
+		SyncRepWaitForLSN(LogwrtResult.Flush, false);
+		RESUME_INTERRUPTS();
+
+		wal_bytes_written = 0;
+	}
+}
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index 9047601534..f9e51420d5 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -526,6 +526,14 @@ XLogInsert(RmgrId rmid, uint8 info)
 
 	XLogResetInsertion();
 
+	/* rep_lag_avoidance_threshold is defined in KB */
+	if (rep_lag_avoidance_threshold &&
+		wal_bytes_written > (rep_lag_avoidance_threshold * 1024))
+	{
+		InterruptPending = true;
+		XLogThrottlePending = true;
+	}
+
 	return EndPos;
 }
 
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 8bc6bea113..fa87f861ca 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -3261,6 +3261,7 @@ ProcessInterrupts(void)
 	{
 		ProcDiePending = false;
 		QueryCancelPending = false; /* ProcDie trumps QueryCancel */
+		XLogThrottlePending = false;
 		LockErrorCleanup();
 		/* As in quickdie, don't risk sending to client during auth */
 		if (ClientAuthInProgress && whereToSendOutput == DestRemote)
@@ -3479,6 +3480,12 @@ ProcessInterrupts(void)
 
 	if (ParallelApplyMessagePending)
 		HandleParallelApplyMessages();
+
+	if (XLogThrottlePending)
+	{
+		XLogThrottlePending = false;
+		wait_to_avoid_large_repl_lag();
+	}
 }
 
 /*
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index 03a54451ac..462f302324 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -39,6 +39,7 @@ volatile sig_atomic_t IdleSessionTimeoutPending = false;
 volatile sig_atomic_t ProcSignalBarrierPending = false;
 volatile sig_atomic_t LogMemoryContextPending = false;
 volatile sig_atomic_t IdleStatsUpdateTimeoutPending = false;
+volatile sig_atomic_t XLogThrottlePending = false;
 volatile uint32 InterruptHoldoffCount = 0;
 volatile uint32 QueryCancelHoldoffCount = 0;
 volatile uint32 CritSectionCount = 0;
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index af227b1f24..e14d52b6a8 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2902,6 +2902,18 @@ struct config_int ConfigureNamesInt[] =
 		NULL, assign_max_wal_size, NULL
 	},
 
+	{
+		{"wait_for_replication_threshold", PGC_SIGHUP, REPLICATION_PRIMARY,
+			gettext_noop("Maximum amount of WAL written by a transaction prior to waiting for replication."),
+			gettext_noop("This is used just to prevent primary from racing too far ahead "
+							"of the standby. A value of 0 disables the behavior"),
+			GUC_UNIT_KB
+		},
+		&rep_lag_avoidance_threshold,
+		0, 0, MAX_KILOBYTES,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"checkpoint_timeout", PGC_SIGHUP, WAL_CHECKPOINTS,
 			gettext_noop("Sets the maximum time between automatic WAL checkpoints."),
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 083810f5b4..d9461d4f65 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -56,9 +56,12 @@ extern PGDLLIMPORT int CommitDelay;
 extern PGDLLIMPORT int CommitSiblings;
 extern PGDLLIMPORT bool track_wal_io_timing;
 extern PGDLLIMPORT int wal_decode_buffer_size;
+extern PGDLLIMPORT int rep_lag_avoidance_threshold;
 
 extern PGDLLIMPORT int CheckPointSegments;
 
+extern uint64_t wal_bytes_written;
+
 /* Archive modes */
 typedef enum ArchiveMode
 {
@@ -268,6 +271,7 @@ extern void ReachedEndOfBackup(XLogRecPtr EndRecPtr, TimeLineID tli);
 extern void SetInstallXLogFileSegmentActive(void);
 extern bool IsInstallXLogFileSegmentActive(void);
 extern void XLogShutdownWalRcv(void);
+extern void wait_to_avoid_large_repl_lag(void);
 
 /*
  * Routines to start, stop, and get status of a base backup.
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 25348e71eb..613c18edb8 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -96,6 +96,7 @@ extern PGDLLIMPORT volatile sig_atomic_t IdleSessionTimeoutPending;
 extern PGDLLIMPORT volatile sig_atomic_t ProcSignalBarrierPending;
 extern PGDLLIMPORT volatile sig_atomic_t LogMemoryContextPending;
 extern PGDLLIMPORT volatile sig_atomic_t IdleStatsUpdateTimeoutPending;
+extern PGDLLIMPORT volatile sig_atomic_t XLogThrottlePending;
 
 extern PGDLLIMPORT volatile sig_atomic_t CheckClientConnectionPending;
 extern PGDLLIMPORT volatile sig_atomic_t ClientConnectionLost;
diff --git a/src/test/recovery/t/044_wal_throttle.pl b/src/test/recovery/t/044_wal_throttle.pl
new file mode 100644
index 0000000000..51053d22a0
--- /dev/null
+++ b/src/test/recovery/t/044_wal_throttle.pl
@@ -0,0 +1,82 @@
+
+# Copyright (c) 2021-2024, PostgreSQL Global Development Group
+
+# Test wait_for_replication_threshold
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Utils;
+use PostgreSQL::Test::Cluster;
+use Test::More;
+
+# This test depends on Perl's `kill`, which apparently is not
+# portable to Windows.  (It would be nice to use Test::More's `subtest`,
+# but that's not in the ancient version we require.)
+if ($PostgreSQL::Test::Utils::windows_os)
+{
+    done_testing();
+    exit;
+}
+
+# Initialize primary node, setting the replication threshold to 1KB
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(allows_streaming => 1);
+$node_primary->append_conf(
+    'postgresql.conf', qq(
+synchronous_commit = on
+synchronous_standby_names = '*'
+wait_for_replication_threshold = 1
+));
+$node_primary->start;
+
+# Take backup
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Create streaming standby from backup
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+$node_standby->init_from_backup($node_primary, $backup_name,
+    has_streaming => 1);
+$node_standby->start;
+
+# Wait for standby to catchup
+$node_primary->wait_for_catchup($node_standby);
+
+$node_primary->safe_psql('postgres',
+    "CREATE TABLE throttle_test(i int)"
+);
+
+# Pause the walreceiver process and insert data from a background psql session.
+# This will create a WAL lag in the standby, leading to a throttled
+# backend 'insert' session, that is waiting for the standby to catchup.
+my $receiverpid = $node_standby->safe_psql('postgres',
+    "SELECT pid FROM pg_stat_activity WHERE backend_type = 'walreceiver'");
+like($receiverpid, qr/^[0-9]+$/, "have walreceiver pid $receiverpid");
+
+kill 'STOP', $receiverpid;
+
+my $throttle_h = $node_primary->background_psql('postgres');
+
+$throttle_h->query_until(
+    qr/start/, q(
+\echo start
+BEGIN;
+INSERT INTO throttle_test SELECT 1 FROM generate_series(1, 1000000);
+));
+
+# Check for SyncRep wait event
+$node_primary->poll_query_until('postgres',
+    "SELECT EXISTS(SELECT * FROM pg_stat_activity WHERE wait_event = 'SyncRep')")
+    or die "timed out trying to find throttled backend";
+
+# Resume the walreceiver process and query
+# until the wait event is gone
+kill 'CONT', $receiverpid;
+
+$node_primary->poll_query_until('postgres',
+    "SELECT NOT EXISTS(SELECT * FROM pg_stat_activity WHERE wait_event = 'SyncRep')")
+    or die "timed out waiting for throttled backend to get unblocked";
+
+$throttle_h->quit;
+
+done_testing();
-- 
2.39.3 (Apple Git-146)

#2Jakub Wartak
jakub.wartak@enterprisedb.com
In reply to: Shirisha Shirisha (#1)
Re: Redux: Throttle WAL inserts before commit

On Tue, Aug 27, 2024 at 12:51 PM Shirisha Shirisha
<shirisha.sn@broadcom.com> wrote:

Hello hackers,

This is an attempt to resurrect the thread [1] to throttle WAL inserts
before the point of commit.

Background:

Transactions on commit, wait for replication and make sure WAL is
flushed up to commit lsn on standby, when synchronous_commit is on.

Hi Shirisha,

Just to let you know, there was a more recent attempt at that in [1]
in Jan 2023 , also with a resurrection attempt there in Nov 2023 by
Tomas. Those patches there seemed to have received plenty of attention
back then and were also based on SyncRepWaitForLSN(), but somehow
maybe we ran out of steam and there was not that big interest back
then.

Maybe you could post a review there (for Tomas's more modern recent
patch), if it is helping your use case even today. That way it could
get some traction again?

-Jakub Wartak.