Queries that should be canceled will get stuck on secure_write function
Hi, all
Recently, I got a problem that the startup process of standby is stuck and keeps in a waiting state. The backtrace of startup process shows that it is waiting for a backend process which conflicts with recovery processing to exit, the guc parameters max_standby_streaming_delay and max_standby_archive_delay are both set as 30 seconds, but it doesn't work. The backend process keeps alive, and the backtrace of this backend process show that it is waiting for the socket to be writeable in secure_write(). After further reading the code, I found that ProcessClientWriteInterrupt() in secure_write() only process interrupts when ProcDiePending is true, otherwise do nothing. However, snapshot conflicts with recovery will only set QueryCancelPending as true, so the response to the signal will de delayed indefinitely if the corresponding client is stuck, thus blocking the recovery process.
I want to know why the interrupt is only handled when ProcDiePending is true, I think query which is supposed to be canceled also should respond to the signal.
I also want to share a patch with you, I add a guc parameter max_standby_client_write_delay, if a query is supposed to be canceled, and the time delayed by a client exceeds max_standby_client_write_delay, then set ProcDiePending as true to avoid being delayed indefinitely, what do you think of it, hope to get your reply.
Thanks & Best Regard
Attachments:
0001-Set-max-client-write-delay-timeout-for-query-which-i.patchapplication/octet-streamDownload
From bbcf92ed6d17e2c6f4a8e8607a34534869d5edd6 Mon Sep 17 00:00:00 2001
From: "suyu.cmj" <mengjuan.cmj@alibaba-inc.com>
Date: Sun, 22 Aug 2021 19:16:21 +0800
Subject: [PATCH] Set max client write delay timeout for query which is
supposed to be canceled on a hot standby server
---
src/backend/libpq/be-secure.c | 19 +++++++++++++++++++
src/backend/utils/misc/guc.c | 11 +++++++++++
src/include/libpq/libpq.h | 1 +
3 files changed, 31 insertions(+)
diff --git a/src/backend/libpq/be-secure.c b/src/backend/libpq/be-secure.c
index 8ef083200a..bb5b39e527 100644
--- a/src/backend/libpq/be-secure.c
+++ b/src/backend/libpq/be-secure.c
@@ -36,6 +36,7 @@
#include "storage/proc.h"
#include "tcop/tcopprot.h"
#include "utils/memutils.h"
+#include "utils/timestamp.h"
char *ssl_library;
char *ssl_cert_file;
@@ -63,6 +64,9 @@ bool SSLPreferServerCiphers;
int ssl_min_protocol_version;
int ssl_max_protocol_version;
+/* GUC variable: Max client write delay for a standby query which is supposed to be canceled */
+int max_standby_client_write_delay = 30 * 1000;
+
/* ------------------------------------------------------------ */
/* Procedures common to all secure sessions */
/* ------------------------------------------------------------ */
@@ -261,6 +265,7 @@ secure_write(Port *port, void *ptr, size_t len)
{
ssize_t n;
int waitfor;
+ TimestampTz waitstart = GetCurrentTimestamp();
/* Deal with any already-pending interrupt condition. */
ProcessClientWriteInterrupt(false);
@@ -287,6 +292,9 @@ retry:
waitfor = WL_SOCKET_WRITEABLE;
}
+ if (n >= 0)
+ waitstart = GetCurrentTimestamp();
+
if (n < 0 && !port->noblock && (errno == EWOULDBLOCK || errno == EAGAIN))
{
WaitEvent event;
@@ -298,6 +306,17 @@ retry:
WaitEventSetWait(FeBeWaitSet, -1 /* no timeout */ , &event, 1,
WAIT_EVENT_CLIENT_WRITE);
+ /*
+ * if current query is supposed to be canceled,
+ * and the time delayed by a client exceeds max_standby_client_write_delay, then set
+ * ProcDiePending as true to handle interrupt, avoid being delayed indefinitely by a stuck client
+ */
+ if (!ProcDiePending &&
+ QueryCancelPending &&
+ (max_standby_client_write_delay >= 0) &&
+ TimestampDifferenceExceeds(waitstart, GetCurrentTimestamp(), max_standby_client_write_delay))
+ ProcDiePending = true;
+
/* See comments in secure_read. */
if (event.events & WL_POSTMASTER_DEATH)
ereport(FATAL,
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index a2e0f8de7e..69f65ae73e 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2257,6 +2257,17 @@ static struct config_int ConfigureNamesInt[] =
NULL, NULL, NULL
},
+ {
+ {"max_standby_client_write_delay", PGC_SIGHUP, REPLICATION_STANDBY,
+ gettext_noop("Sets the maximum delay for query which is waiting for client write and is supposed to be canceled on a hot standby server."),
+ NULL,
+ GUC_UNIT_MS
+ },
+ &max_standby_client_write_delay,
+ 30 * 1000, -1, INT_MAX,
+ NULL, NULL, NULL
+ },
+
{
{"recovery_min_apply_delay", PGC_SIGHUP, REPLICATION_STANDBY,
gettext_noop("Sets the minimum delay for applying changes during recovery."),
diff --git a/src/include/libpq/libpq.h b/src/include/libpq/libpq.h
index 6c51b2f20f..1776553f43 100644
--- a/src/include/libpq/libpq.h
+++ b/src/include/libpq/libpq.h
@@ -121,6 +121,7 @@ extern char *SSLECDHCurve;
extern bool SSLPreferServerCiphers;
extern int ssl_min_protocol_version;
extern int ssl_max_protocol_version;
+extern int max_standby_client_write_delay;
enum ssl_protocol_versions
{
--
2.24.3 (Apple Git-128)