Improve the connection failure error messages
Hi Hackers,
Various sections of the code utilize the walrcv_connect() function,
employed by various processes such as walreceiver, logical replication
apply worker, etc., to establish connections with other hosts.
Presently, in case of connection failures, the error message lacks
information about the specific process attempting to connect and
encountering the failure.
The provided patch enhances error messages for such connection
failures, offering more details on the processes that failed to
establish a connection.
--
Thanks,
Nisha
Attachments:
v1-0001-Improve-the-connection-failure-error-messages.patchapplication/octet-stream; name=v1-0001-Improve-the-connection-failure-error-messages.patchDownload
From 556c249973cc01acde91ab1e8b31df71713bfb5a Mon Sep 17 00:00:00 2001
From: Nisha Moond <nisha.moond412@gmail.com>
Date: Wed, 10 Jan 2024 12:04:18 +0530
Subject: [PATCH v3] Improve the connection failure error messages
Enhances the error messages for walrcv_connect() connection failures,
offering more details on the processes that failed to establish a connection
---
src/backend/commands/subscriptioncmds.c | 4 ++--
src/backend/replication/logical/tablesync.c | 3 ++-
src/backend/replication/logical/worker.c | 3 ++-
src/backend/replication/walreceiver.c | 9 +++++----
src/test/regress/expected/subscription.out | 2 +-
5 files changed, 12 insertions(+), 9 deletions(-)
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 75e6cd8ae3..52312adaeb 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -742,7 +742,7 @@ CreateSubscription(ParseState *pstate, CreateSubscriptionStmt *stmt,
if (!wrconn)
ereport(ERROR,
(errcode(ERRCODE_CONNECTION_FAILURE),
- errmsg("could not connect to the publisher: %s", err)));
+ errmsg("\"%s\" could not connect to the publisher: %s", stmt->subname, err)));
PG_TRY();
{
@@ -875,7 +875,7 @@ AlterSubscription_refresh(Subscription *sub, bool copy_data,
if (!wrconn)
ereport(ERROR,
(errcode(ERRCODE_CONNECTION_FAILURE),
- errmsg("could not connect to the publisher: %s", err)));
+ errmsg("\"%s\" could not connect to the publisher: %s", sub->name, err)));
PG_TRY();
{
diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index 06d5b3df33..1e4b878bf4 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -1335,7 +1335,8 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
if (LogRepWorkerWalRcvConn == NULL)
ereport(ERROR,
(errcode(ERRCODE_CONNECTION_FAILURE),
- errmsg("could not connect to the publisher: %s", err)));
+ errmsg("logical replication table synchronization worker for subscription \"%s\" could not connect to the publisher: %s",
+ MySubscription->name, err)));
Assert(MyLogicalRepWorker->relstate == SUBREL_STATE_INIT ||
MyLogicalRepWorker->relstate == SUBREL_STATE_DATASYNC ||
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 911835c5cb..5553a6becd 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -4518,7 +4518,8 @@ run_apply_worker()
if (LogRepWorkerWalRcvConn == NULL)
ereport(ERROR,
(errcode(ERRCODE_CONNECTION_FAILURE),
- errmsg("could not connect to the publisher: %s", err)));
+ errmsg("logical replication apply worker for subscription \"%s\" could not connect to the publisher: %s",
+ MySubscription->name, err)));
/*
* We don't really use the output identify_system for anything but it does
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index e00395ff2b..9bdd501ea4 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -198,6 +198,7 @@ WalReceiverMain(void)
char *err;
char *sender_host = NULL;
int sender_port = 0;
+ char *appname = NULL;
/*
* WalRcv should be set up already (if we are a backend, we inherit this
@@ -295,14 +296,14 @@ WalReceiverMain(void)
/* Unblock signals (they were blocked when the postmaster forked us) */
sigprocmask(SIG_SETMASK, &UnBlockSig, NULL);
+ appname = cluster_name[0] ? cluster_name : "walreceiver";
+
/* Establish the connection to the primary for XLOG streaming */
- wrconn = walrcv_connect(conninfo, false, false,
- cluster_name[0] ? cluster_name : "walreceiver",
- &err);
+ wrconn = walrcv_connect(conninfo, false, false, appname, &err);
if (!wrconn)
ereport(ERROR,
(errcode(ERRCODE_CONNECTION_FAILURE),
- errmsg("could not connect to the primary server: %s", err)));
+ errmsg("%s could not connect to the primary server: %s", appname, err)));
/*
* Save user-visible connection string. This clobbers the original
diff --git a/src/test/regress/expected/subscription.out b/src/test/regress/expected/subscription.out
index b15eddbff3..b55ce326ea 100644
--- a/src/test/regress/expected/subscription.out
+++ b/src/test/regress/expected/subscription.out
@@ -139,7 +139,7 @@ ERROR: invalid connection string syntax: invalid connection option "i_dont_exis
-- fail, connection string parses, but doesn't work (and does so without
-- connecting, so this is reliable and safe)
CREATE SUBSCRIPTION regress_testsub5 CONNECTION 'port=-1' PUBLICATION testpub;
-ERROR: could not connect to the publisher: invalid port number: "-1"
+ERROR: "regress_testsub5" could not connect to the publisher: invalid port number: "-1"
-- fail - invalid connection string during ALTER
ALTER SUBSCRIPTION regress_testsub CONNECTION 'foobar';
ERROR: invalid connection string syntax: missing "=" after "foobar" in connection info string
--
2.34.1
Thanks for the patch! Here are a couple of review comments for it.
======
src/backend/commands/subscriptioncmds.c
1.
@@ -742,7 +742,7 @@ CreateSubscription(ParseState *pstate,
CreateSubscriptionStmt *stmt,
if (!wrconn)
ereport(ERROR,
(errcode(ERRCODE_CONNECTION_FAILURE),
- errmsg("could not connect to the publisher: %s", err)));
+ errmsg("\"%s\" could not connect to the publisher: %s", stmt->subname, err)));
In practice, these commands give errors like:
test_sub=# create subscription sub1 connection 'dbname=bogus' publication pub1;
ERROR: could not connect to the publisher: connection to server on
socket "/tmp/.s.PGSQL.5432" failed: FATAL: database "bogus" does not
exist
and logs like:
2024-01-12 12:45:05.177 AEDT [13108] ERROR: could not connect to the
publisher: connection to server on socket "/tmp/.s.PGSQL.5432" failed:
FATAL: database "bogus" does not exist
2024-01-12 12:45:05.177 AEDT [13108] STATEMENT: create subscription
sub1 connection 'dbname=bogus' publication pub1;
Since the subscription name is already obvious from the statement that
caused the error I'm not sure it benefits much to add this to the
error message (but maybe it is useful if the error message was somehow
read in isolation from the statement).
Anyway, I felt at least it should include the word "subscription" for
better consistency with the other messages in this patch:
SUGGESTION
subscription \"%s\" could not connect to the publisher: %s
======
2.
+ appname = cluster_name[0] ? cluster_name : "walreceiver";
+
/* Establish the connection to the primary for XLOG streaming */
- wrconn = walrcv_connect(conninfo, false, false,
- cluster_name[0] ? cluster_name : "walreceiver",
- &err);
+ wrconn = walrcv_connect(conninfo, false, false, appname, &err);
if (!wrconn)
ereport(ERROR,
(errcode(ERRCODE_CONNECTION_FAILURE),
- errmsg("could not connect to the primary server: %s", err)));
+ errmsg("%s could not connect to the primary server: %s", appname, err)));
I think your new %s should be quoted according to the guidelines at [1]https://www.postgresql.org/docs/current/error-style-guide.html.
======
src/test/regress/expected/subscription.out
3.
Apparently, there is no existing regression test case for the ALTER
"could not connect" message because if there was, it would have
failed. Maybe a test should be added?
======
[1]: https://www.postgresql.org/docs/current/error-style-guide.html
Kind Regards,
Peter Smith.
Fujitsu Australia
Thanks for the review. Attached v2 patch with suggested changes.
Please find my response inline.
On Fri, Jan 12, 2024 at 8:20 AM Peter Smith <smithpb2250@gmail.com> wrote:
Thanks for the patch! Here are a couple of review comments for it.
======
src/backend/commands/subscriptioncmds.c1. @@ -742,7 +742,7 @@ CreateSubscription(ParseState *pstate, CreateSubscriptionStmt *stmt, if (!wrconn) ereport(ERROR, (errcode(ERRCODE_CONNECTION_FAILURE), - errmsg("could not connect to the publisher: %s", err))); + errmsg("\"%s\" could not connect to the publisher: %s", stmt->subname, err)));In practice, these commands give errors like:
test_sub=# create subscription sub1 connection 'dbname=bogus' publication pub1;
ERROR: could not connect to the publisher: connection to server on
socket "/tmp/.s.PGSQL.5432" failed: FATAL: database "bogus" does not
existand logs like:
2024-01-12 12:45:05.177 AEDT [13108] ERROR: could not connect to the
publisher: connection to server on socket "/tmp/.s.PGSQL.5432" failed:
FATAL: database "bogus" does not exist
2024-01-12 12:45:05.177 AEDT [13108] STATEMENT: create subscription
sub1 connection 'dbname=bogus' publication pub1;Since the subscription name is already obvious from the statement that
caused the error I'm not sure it benefits much to add this to the
error message (but maybe it is useful if the error message was somehow
read in isolation from the statement).Anyway, I felt at least it should include the word "subscription" for
better consistency with the other messages in this patch:SUGGESTION
subscription \"%s\" could not connect to the publisher: %s
Done.
======
2. + appname = cluster_name[0] ? cluster_name : "walreceiver"; + /* Establish the connection to the primary for XLOG streaming */ - wrconn = walrcv_connect(conninfo, false, false, - cluster_name[0] ? cluster_name : "walreceiver", - &err); + wrconn = walrcv_connect(conninfo, false, false, appname, &err); if (!wrconn) ereport(ERROR, (errcode(ERRCODE_CONNECTION_FAILURE), - errmsg("could not connect to the primary server: %s", err))); + errmsg("%s could not connect to the primary server: %s", appname, err)));I think your new %s should be quoted according to the guidelines at [1].
Done.
======
src/test/regress/expected/subscription.out3.
Apparently, there is no existing regression test case for the ALTER
"could not connect" message because if there was, it would have
failed. Maybe a test should be added?
The ALTER SUBSCRIPTION command does not error out on the user
interface if updated with a bad connection string and the connection
failure error can only be seen in the respective log file.
Due to this behavior, it is not possible to add a test to show the
error message as it is done for CREATE SUBSCRIPTION.
Let me know if you think there is another way to add this test.
--
Thanks,
Nisha
Attachments:
v2-0001-Improve-the-connection-failure-error-messages.patchapplication/octet-stream; name=v2-0001-Improve-the-connection-failure-error-messages.patchDownload
From ba3e38495b038bdfc176e25d3e1b721f09d0053b Mon Sep 17 00:00:00 2001
From: Nisha Moond <nisha.moond412@gmail.com>
Date: Wed, 10 Jan 2024 12:04:18 +0530
Subject: [PATCH v2] Improve the connection failure error messages
Enhances the error messages for walrcv_connect() connection failures,
offering more details on the processes that failed to establish a connection
---
src/backend/commands/subscriptioncmds.c | 4 ++--
src/backend/replication/logical/tablesync.c | 3 ++-
src/backend/replication/logical/worker.c | 3 ++-
src/backend/replication/walreceiver.c | 9 +++++----
src/test/regress/expected/subscription.out | 2 +-
5 files changed, 12 insertions(+), 9 deletions(-)
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 75e6cd8ae3..3477769a74 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -742,7 +742,7 @@ CreateSubscription(ParseState *pstate, CreateSubscriptionStmt *stmt,
if (!wrconn)
ereport(ERROR,
(errcode(ERRCODE_CONNECTION_FAILURE),
- errmsg("could not connect to the publisher: %s", err)));
+ errmsg("subscription \"%s\" could not connect to the publisher: %s", stmt->subname, err)));
PG_TRY();
{
@@ -875,7 +875,7 @@ AlterSubscription_refresh(Subscription *sub, bool copy_data,
if (!wrconn)
ereport(ERROR,
(errcode(ERRCODE_CONNECTION_FAILURE),
- errmsg("could not connect to the publisher: %s", err)));
+ errmsg("subscription \"%s\" could not connect to the publisher: %s", sub->name, err)));
PG_TRY();
{
diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index 06d5b3df33..1e4b878bf4 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -1335,7 +1335,8 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
if (LogRepWorkerWalRcvConn == NULL)
ereport(ERROR,
(errcode(ERRCODE_CONNECTION_FAILURE),
- errmsg("could not connect to the publisher: %s", err)));
+ errmsg("logical replication table synchronization worker for subscription \"%s\" could not connect to the publisher: %s",
+ MySubscription->name, err)));
Assert(MyLogicalRepWorker->relstate == SUBREL_STATE_INIT ||
MyLogicalRepWorker->relstate == SUBREL_STATE_DATASYNC ||
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 911835c5cb..5553a6becd 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -4518,7 +4518,8 @@ run_apply_worker()
if (LogRepWorkerWalRcvConn == NULL)
ereport(ERROR,
(errcode(ERRCODE_CONNECTION_FAILURE),
- errmsg("could not connect to the publisher: %s", err)));
+ errmsg("logical replication apply worker for subscription \"%s\" could not connect to the publisher: %s",
+ MySubscription->name, err)));
/*
* We don't really use the output identify_system for anything but it does
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index e00395ff2b..9d744dd5d5 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -198,6 +198,7 @@ WalReceiverMain(void)
char *err;
char *sender_host = NULL;
int sender_port = 0;
+ char *appname = NULL;
/*
* WalRcv should be set up already (if we are a backend, we inherit this
@@ -295,14 +296,14 @@ WalReceiverMain(void)
/* Unblock signals (they were blocked when the postmaster forked us) */
sigprocmask(SIG_SETMASK, &UnBlockSig, NULL);
+ appname = cluster_name[0] ? cluster_name : "walreceiver";
+
/* Establish the connection to the primary for XLOG streaming */
- wrconn = walrcv_connect(conninfo, false, false,
- cluster_name[0] ? cluster_name : "walreceiver",
- &err);
+ wrconn = walrcv_connect(conninfo, false, false, appname, &err);
if (!wrconn)
ereport(ERROR,
(errcode(ERRCODE_CONNECTION_FAILURE),
- errmsg("could not connect to the primary server: %s", err)));
+ errmsg("\"%s\" could not connect to the primary server: %s", appname, err)));
/*
* Save user-visible connection string. This clobbers the original
diff --git a/src/test/regress/expected/subscription.out b/src/test/regress/expected/subscription.out
index b15eddbff3..3eb9e287d0 100644
--- a/src/test/regress/expected/subscription.out
+++ b/src/test/regress/expected/subscription.out
@@ -139,7 +139,7 @@ ERROR: invalid connection string syntax: invalid connection option "i_dont_exis
-- fail, connection string parses, but doesn't work (and does so without
-- connecting, so this is reliable and safe)
CREATE SUBSCRIPTION regress_testsub5 CONNECTION 'port=-1' PUBLICATION testpub;
-ERROR: could not connect to the publisher: invalid port number: "-1"
+ERROR: subscription "regress_testsub5" could not connect to the publisher: invalid port number: "-1"
-- fail - invalid connection string during ALTER
ALTER SUBSCRIPTION regress_testsub CONNECTION 'foobar';
ERROR: invalid connection string syntax: missing "=" after "foobar" in connection info string
--
2.34.1
Hi,
Thanks for the patch.
Due to this behavior, it is not possible to add a test to show the
error message as it is done for CREATE SUBSCRIPTION.
Let me know if you think there is another way to add this test.
I believe it can be done with TAP tests, see for instance:
contrib/auto_explain/t/001_auto_explain.pl
However I wouldn't insist on including the test in scope of this
particular patch. Such a test doesn't currently exist, it can be added
as a separate patch, and whether this is actually a useful test (all
the tests consume time after all...) is somewhat debatable. Personally
I agree that it would be nice to have though.
This being said...
The ALTER SUBSCRIPTION command does not error out on the user
interface if updated with a bad connection string and the connection
failure error can only be seen in the respective log file.
I wonder if we should fix this. Again, not necessarily in scope of
this work, but if this is not a difficult task, again, it would be
nice to have.
Other than that v2 looks good.
--
Best regards,
Aleksander Alekseev
On Sat, Jan 13, 2024 at 12:36 AM Aleksander Alekseev
<aleksander@timescale.com> wrote:
Hi,
Thanks for the patch.
Due to this behavior, it is not possible to add a test to show the
error message as it is done for CREATE SUBSCRIPTION.
Let me know if you think there is another way to add this test.I believe it can be done with TAP tests, see for instance:
contrib/auto_explain/t/001_auto_explain.pl
However I wouldn't insist on including the test in scope of this
particular patch. Such a test doesn't currently exist, it can be added
as a separate patch, and whether this is actually a useful test (all
the tests consume time after all...) is somewhat debatable. Personally
I agree that it would be nice to have though.This being said...
The ALTER SUBSCRIPTION command does not error out on the user
interface if updated with a bad connection string and the connection
failure error can only be seen in the respective log file.I wonder if we should fix this. Again, not necessarily in scope of
this work, but if this is not a difficult task, again, it would be
nice to have.Other than that v2 looks good.
OK. I see now that any ALTER of the subscription's connection, even
to some value that fails, will restart a new worker (like ALTER of any
other subscription parameters). For a bad connection, it will continue
to relaunch-worker/ERROR over and over.
test_sub=# \r2024-01-17 09:34:28.665 AEDT [11274] LOG: logical
replication apply worker for subscription "sub4" has started
2024-01-17 09:34:28.666 AEDT [11274] ERROR: could not connect to the
publisher: invalid port number: "-1"
2024-01-17 09:34:28.667 AEDT [928] LOG: background worker "logical
replication apply worker" (PID 11274) exited with exit code 1
dRs su2024-01-17 09:34:33.669 AEDT [11391] LOG: logical replication
apply worker for subscription "sub4" has started
2024-01-17 09:34:33.669 AEDT [11391] ERROR: could not connect to the
publisher: invalid port number: "-1"
2024-01-17 09:34:33.670 AEDT [928] LOG: background worker "logical
replication apply worker" (PID 11391) exited with exit code 1
b4
...
I don't really have any opinion if that behaviour is good or bad, but
anyway, it is deliberate, and IMO it is outside the scope of your
patch, so v2 patch LGTM.
~~
BTW, while experimenting with the bad connection ALTER I also tried
setting 'disable_on_error' like below:
ALTER SUBSCRIPTION sub4 SET (disable_on_error);
ALTER SUBSCRIPTION sub4 CONNECTION 'port = -1';
...but here the subscription did not become DISABLED as I expected it
would do on the next connection error iteration. It remains enabled
and just continues to loop relaunch/ERROR indefinitely same as before.
That looks like it may be a bug. Thoughts?
======
Kind Regards,
Peter Smith.
Fujitsu Australia
Thanks for reviewing, please find my response inline.
On Wed, Jan 17, 2024 at 4:56 AM Peter Smith <smithpb2250@gmail.com> wrote:
On Sat, Jan 13, 2024 at 12:36 AM Aleksander Alekseev
<aleksander@timescale.com> wrote:Hi,
Thanks for the patch.
Due to this behavior, it is not possible to add a test to show the
error message as it is done for CREATE SUBSCRIPTION.
Let me know if you think there is another way to add this test.I believe it can be done with TAP tests, see for instance:
contrib/auto_explain/t/001_auto_explain.pl
However I wouldn't insist on including the test in scope of this
particular patch. Such a test doesn't currently exist, it can be added
as a separate patch, and whether this is actually a useful test (all
the tests consume time after all...) is somewhat debatable. Personally
I agree that it would be nice to have though.This being said...
The ALTER SUBSCRIPTION command does not error out on the user
interface if updated with a bad connection string and the connection
failure error can only be seen in the respective log file.I wonder if we should fix this. Again, not necessarily in scope of
this work, but if this is not a difficult task, again, it would be
nice to have.Other than that v2 looks good.
OK. I see now that any ALTER of the subscription's connection, even
to some value that fails, will restart a new worker (like ALTER of any
other subscription parameters). For a bad connection, it will continue
to relaunch-worker/ERROR over and over.test_sub=# \r2024-01-17 09:34:28.665 AEDT [11274] LOG: logical
replication apply worker for subscription "sub4" has started
2024-01-17 09:34:28.666 AEDT [11274] ERROR: could not connect to the
publisher: invalid port number: "-1"
2024-01-17 09:34:28.667 AEDT [928] LOG: background worker "logical
replication apply worker" (PID 11274) exited with exit code 1
dRs su2024-01-17 09:34:33.669 AEDT [11391] LOG: logical replication
apply worker for subscription "sub4" has started
2024-01-17 09:34:33.669 AEDT [11391] ERROR: could not connect to the
publisher: invalid port number: "-1"
2024-01-17 09:34:33.670 AEDT [928] LOG: background worker "logical
replication apply worker" (PID 11391) exited with exit code 1
b4
...I don't really have any opinion if that behaviour is good or bad, but
anyway, it is deliberate, and IMO it is outside the scope of your
patch, so v2 patch LGTM.
Upon code review, the ALTER SUBSCRIPTION updates the connection string
after checking for parse and a few obvious errors and does not attempt
to establish a connection. It is the apply worker running for the
respective subscription that will try to connect and fail in case of a
bad connection string.
To me, it seems an intentional design behavior and I agree that
deciding to change or maintain this behavior is out of this patch's
scope.
--
Thanks,
Nisha
~~
BTW, while experimenting with the bad connection ALTER I also tried
setting 'disable_on_error' like below:ALTER SUBSCRIPTION sub4 SET (disable_on_error);
ALTER SUBSCRIPTION sub4 CONNECTION 'port = -1';...but here the subscription did not become DISABLED as I expected it
would do on the next connection error iteration. It remains enabled
and just continues to loop relaunch/ERROR indefinitely same as before.That looks like it may be a bug. Thoughts?
Ideally, if the already running apply worker in
"LogicalRepApplyLoop()" has any exception/error it will be handled and
the subscription will be disabled if 'disable_on_error' is set -
start_apply(XLogRecPtr origin_startpos)
{
PG_TRY();
{
LogicalRepApplyLoop(origin_startpos);
}
PG_CATCH();
{
if (MySubscription->disableonerr)
DisableSubscriptionAndExit();
...
What is happening in this case is that the control reaches the function -
run_apply_worker() -> start_apply() -> LogicalRepApplyLoop ->
maybe_reread_subscription()
...
/*
* Exit if any parameter that affects the remote connection was changed.
* The launcher will start a new worker but note that the parallel apply
* worker won't restart if the streaming option's value is changed from
* 'parallel' to any other value or the server decides not to stream the
* in-progress transaction.
*/
if (strcmp(newsub->conninfo, MySubscription->conninfo) != 0 ||
...
and it sees a change in the parameter and calls apply_worker_exit().
This will exit the current process, without throwing an exception to
the caller and the postmaster will try to restart the apply worker.
The new apply worker, before reaching the start_apply() [where we
handle exception], will hit the code to establish the connection to
the publisher -
ApplyWorkerMain() -> run_apply_worker() -
...
LogRepWorkerWalRcvConn = walrcv_connect(MySubscription->conninfo,
true /* replication */ ,
true,
must_use_password,
MySubscription->name, &err);
if (LogRepWorkerWalRcvConn == NULL)
ereport(ERROR,
(errcode(ERRCODE_CONNECTION_FAILURE),
errmsg("could not connect to the publisher: %s", err)));
...
and due to the bad connection string in the subscription, it will error out.
[28680]: ERROR: could not connect to the publisher: invalid port number: "-1"
[3196]: LOG: background worker "logical replication apply worker" (PID 28680) exited with exit code 1
28680) exited with exit code 1
Now, the postmaster keeps trying to restart the apply worker and it
will keep failing until the connection string is corrected or the
subscription is disabled manually.
I think this is a bug that needs to be handled in run_apply_worker()
when disable_on_error is set.
IMO, this bug-fix discussion deserves a separate thread. Thoughts?
On Wed, Jan 17, 2024 at 7:15 PM Nisha Moond <nisha.moond412@gmail.com> wrote:
~~
BTW, while experimenting with the bad connection ALTER I also tried
setting 'disable_on_error' like below:ALTER SUBSCRIPTION sub4 SET (disable_on_error);
ALTER SUBSCRIPTION sub4 CONNECTION 'port = -1';...but here the subscription did not become DISABLED as I expected it
would do on the next connection error iteration. It remains enabled
and just continues to loop relaunch/ERROR indefinitely same as before.That looks like it may be a bug. Thoughts?
Ideally, if the already running apply worker in
"LogicalRepApplyLoop()" has any exception/error it will be handled and
the subscription will be disabled if 'disable_on_error' is set -start_apply(XLogRecPtr origin_startpos)
{
PG_TRY();
{
LogicalRepApplyLoop(origin_startpos);
}
PG_CATCH();
{
if (MySubscription->disableonerr)
DisableSubscriptionAndExit();
...What is happening in this case is that the control reaches the function -
run_apply_worker() -> start_apply() -> LogicalRepApplyLoop ->
maybe_reread_subscription()
...
/*
* Exit if any parameter that affects the remote connection was changed.
* The launcher will start a new worker but note that the parallel apply
* worker won't restart if the streaming option's value is changed from
* 'parallel' to any other value or the server decides not to stream the
* in-progress transaction.
*/
if (strcmp(newsub->conninfo, MySubscription->conninfo) != 0 ||
...and it sees a change in the parameter and calls apply_worker_exit().
This will exit the current process, without throwing an exception to
the caller and the postmaster will try to restart the apply worker.
The new apply worker, before reaching the start_apply() [where we
handle exception], will hit the code to establish the connection to
the publisher -ApplyWorkerMain() -> run_apply_worker() -
...
LogRepWorkerWalRcvConn = walrcv_connect(MySubscription->conninfo,
true /* replication */ ,
true,
must_use_password,
MySubscription->name, &err);if (LogRepWorkerWalRcvConn == NULL)
ereport(ERROR,
(errcode(ERRCODE_CONNECTION_FAILURE),
errmsg("could not connect to the publisher: %s", err)));
...
and due to the bad connection string in the subscription, it will error out.
[28680] ERROR: could not connect to the publisher: invalid port number: "-1"
[3196] LOG: background worker "logical replication apply worker" (PID
28680) exited with exit code 1Now, the postmaster keeps trying to restart the apply worker and it
will keep failing until the connection string is corrected or the
subscription is disabled manually.I think this is a bug that needs to be handled in run_apply_worker()
when disable_on_error is set.
IMO, this bug-fix discussion deserves a separate thread. Thoughts?
Hi Nisha,
Thanks for your analysis -- it is the same as my understanding.
As suggested, I have created a new thread for any further discussion
related to this 'disable_on_error' topic [1]/messages/by-id/CAHut+PuEsekA3e7ThwzWr+Us4x=LzkF7DSrED1UsZTUqNrhCUQ@mail.gmail.com.
======
[1]: /messages/by-id/CAHut+PuEsekA3e7ThwzWr+Us4x=LzkF7DSrED1UsZTUqNrhCUQ@mail.gmail.com
Kind Regards,
Peter Smith.
Fujitsu Australia
On Fri, Jan 12, 2024 at 7:06 PM Aleksander Alekseev
<aleksander@timescale.com> wrote:
Hi,
Thanks for the patch.
Due to this behavior, it is not possible to add a test to show the
error message as it is done for CREATE SUBSCRIPTION.
Let me know if you think there is another way to add this test.I believe it can be done with TAP tests, see for instance:
contrib/auto_explain/t/001_auto_explain.pl
However I wouldn't insist on including the test in scope of this
particular patch. Such a test doesn't currently exist, it can be added
as a separate patch, and whether this is actually a useful test (all
the tests consume time after all...) is somewhat debatable. Personally
I agree that it would be nice to have though.
Thank you for providing this information. Yes, I can write a TAP test
to check the log for the same error message.
I'll attempt it and perform a time analysis. I'm unsure where to
appropriately add this test. Any suggestions?
Following your suggestion, I won't include the test in the scope of
this patch. Instead, I'll start a new thread once I have sufficient
information required.
--
Thanks,
Nisha
AFAIK some recent commits patches (e,g [1]https://github.com/postgres/postgres/commit/776621a5e4796fa214b6b29a7ca134f6c138572a for the "slot sync"
development) have created some more cases of "could not connect..."
messages. So, you might need to enhance your patch to deal with any
new ones in the latest HEAD.
======
[1]: https://github.com/postgres/postgres/commit/776621a5e4796fa214b6b29a7ca134f6c138572a
Kind Regards,
Peter Smith.
Fujitsu Australia
AFAIK some recent commits patches (e,g [1] for the "slot sync"
development) have created some more cases of "could not connect..."
messages. So, you might need to enhance your patch to deal with any
new ones in the latest HEAD.======
[1]
https://github.com/postgres/postgres/commit/776621a5e4796fa214b6b29a7ca134f6c138572aThank you for the update.
The v3 patch has the changes needed as per the latest HEAD.
--
Thanks,
Nisha
Attachments:
v3-0001-Improve-the-connection-failure-error-messages.patchapplication/octet-stream; name=v3-0001-Improve-the-connection-failure-error-messages.patchDownload
From a8110d71939ccdd81e3de0d41eb07aaf40d98864 Mon Sep 17 00:00:00 2001
From: Nisha Moond <nisha.moond412@gmail.com>
Date: Wed, 10 Jan 2024 12:04:18 +0530
Subject: [PATCH v3] Improve the connection failure error messages
Enhances the error messages for walrcv_connect() connection failures,
offering more details on the processes that failed to establish a connection
---
src/backend/commands/subscriptioncmds.c | 6 +++---
src/backend/replication/logical/tablesync.c | 3 ++-
src/backend/replication/logical/worker.c | 3 ++-
src/backend/replication/walreceiver.c | 9 +++++----
src/test/regress/expected/subscription.out | 2 +-
5 files changed, 13 insertions(+), 10 deletions(-)
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index b647a81fc8..14e5260aef 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -764,7 +764,7 @@ CreateSubscription(ParseState *pstate, CreateSubscriptionStmt *stmt,
if (!wrconn)
ereport(ERROR,
(errcode(ERRCODE_CONNECTION_FAILURE),
- errmsg("could not connect to the publisher: %s", err)));
+ errmsg("subscription \"%s\" could not connect to the publisher: %s", stmt->subname, err)));
PG_TRY();
{
@@ -915,7 +915,7 @@ AlterSubscription_refresh(Subscription *sub, bool copy_data,
if (!wrconn)
ereport(ERROR,
(errcode(ERRCODE_CONNECTION_FAILURE),
- errmsg("could not connect to the publisher: %s", err)));
+ errmsg("subscription \"%s\" could not connect to the publisher: %s", sub->name, err)));
PG_TRY();
{
@@ -1542,7 +1542,7 @@ AlterSubscription(ParseState *pstate, AlterSubscriptionStmt *stmt,
if (!wrconn)
ereport(ERROR,
(errcode(ERRCODE_CONNECTION_FAILURE),
- errmsg("could not connect to the publisher: %s", err)));
+ errmsg("subscription \"%s\" could not connect to the publisher: %s", sub->name, err)));
PG_TRY();
{
diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index 5acab3f3e2..60f4cccb94 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -1335,7 +1335,8 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
if (LogRepWorkerWalRcvConn == NULL)
ereport(ERROR,
(errcode(ERRCODE_CONNECTION_FAILURE),
- errmsg("could not connect to the publisher: %s", err)));
+ errmsg("logical replication table synchronization worker for subscription \"%s\" could not connect to the publisher: %s",
+ MySubscription->name, err)));
Assert(MyLogicalRepWorker->relstate == SUBREL_STATE_INIT ||
MyLogicalRepWorker->relstate == SUBREL_STATE_DATASYNC ||
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 32ff4c0336..d1428031ee 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -4525,7 +4525,8 @@ run_apply_worker()
if (LogRepWorkerWalRcvConn == NULL)
ereport(ERROR,
(errcode(ERRCODE_CONNECTION_FAILURE),
- errmsg("could not connect to the publisher: %s", err)));
+ errmsg("logical replication apply worker for subscription \"%s\" could not connect to the publisher: %s",
+ MySubscription->name, err)));
/*
* We don't really use the output identify_system for anything but it does
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index e29a6196a3..b45ed2f5b3 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -198,6 +198,7 @@ WalReceiverMain(void)
char *err;
char *sender_host = NULL;
int sender_port = 0;
+ char *appname = NULL;
/*
* WalRcv should be set up already (if we are a backend, we inherit this
@@ -295,14 +296,14 @@ WalReceiverMain(void)
/* Unblock signals (they were blocked when the postmaster forked us) */
sigprocmask(SIG_SETMASK, &UnBlockSig, NULL);
+ appname = cluster_name[0] ? cluster_name : "walreceiver";
+
/* Establish the connection to the primary for XLOG streaming */
- wrconn = walrcv_connect(conninfo, false, false,
- cluster_name[0] ? cluster_name : "walreceiver",
- &err);
+ wrconn = walrcv_connect(conninfo, false, false, appname, &err);
if (!wrconn)
ereport(ERROR,
(errcode(ERRCODE_CONNECTION_FAILURE),
- errmsg("could not connect to the primary server: %s", err)));
+ errmsg("\"%s\" could not connect to the primary server: %s", appname, err)));
/*
* Save user-visible connection string. This clobbers the original
diff --git a/src/test/regress/expected/subscription.out b/src/test/regress/expected/subscription.out
index 1eee6b17b8..5c9f8e47de 100644
--- a/src/test/regress/expected/subscription.out
+++ b/src/test/regress/expected/subscription.out
@@ -141,7 +141,7 @@ ERROR: invalid connection string syntax: invalid connection option "i_dont_exis
-- fail, connection string parses, but doesn't work (and does so without
-- connecting, so this is reliable and safe)
CREATE SUBSCRIPTION regress_testsub5 CONNECTION 'port=-1' PUBLICATION testpub;
-ERROR: could not connect to the publisher: invalid port number: "-1"
+ERROR: subscription "regress_testsub5" could not connect to the publisher: invalid port number: "-1"
-- fail - invalid connection string during ALTER
ALTER SUBSCRIPTION regress_testsub CONNECTION 'foobar';
ERROR: invalid connection string syntax: missing "=" after "foobar" in connection info string
--
2.34.1
On Wed, Jan 31, 2024 at 9:58 PM Nisha Moond <nisha.moond412@gmail.com> wrote:
AFAIK some recent commits patches (e,g [1] for the "slot sync"
development) have created some more cases of "could not connect..."
messages. So, you might need to enhance your patch to deal with any
new ones in the latest HEAD.======
[1] https://github.com/postgres/postgres/commit/776621a5e4796fa214b6b29a7ca134f6c138572aThank you for the update.
The v3 patch has the changes needed as per the latest HEAD.
Hi, just going by visual inspection of the v2/v3 patch diffs, the
latest v3 LGTM.
======
Kind Regards,
Peter Smith.
Fujitsu Australia
FYI -- some more code has been pushed since this patch was last
updated. AFAICT perhaps you'll want to update this patch again for the
following new connection messages on HEAD:
- slotfuncs.c [1]https://github.com/postgres/postgres/blob/0b84f5c419a300dc1b1a70cf63b9907208e52643/src/backend/replication/slotfuncs.c#L989
- slotsync.c [2]https://github.com/postgres/postgres/blob/0b84f5c419a300dc1b1a70cf63b9907208e52643/src/backend/replication/logical/slotsync.c#L1258
----------
[1]: https://github.com/postgres/postgres/blob/0b84f5c419a300dc1b1a70cf63b9907208e52643/src/backend/replication/slotfuncs.c#L989
[2]: https://github.com/postgres/postgres/blob/0b84f5c419a300dc1b1a70cf63b9907208e52643/src/backend/replication/logical/slotsync.c#L1258
Kind Regards,
Peter Smith.
Fujitsu Australia
On Wed, Mar 13, 2024 at 11:16 AM Peter Smith <smithpb2250@gmail.com> wrote:
FYI -- some more code has been pushed since this patch was last
updated. AFAICT perhaps you'll want to update this patch again for the
following new connection messages on HEAD:- slotfuncs.c [1]
- slotsync.c [2]----------
[1] https://github.com/postgres/postgres/blob/0b84f5c419a300dc1b1a70cf63b9907208e52643/src/backend/replication/slotfuncs.c#L989
[2] https://github.com/postgres/postgres/blob/0b84f5c419a300dc1b1a70cf63b9907208e52643/src/backend/replication/logical/slotsync.c#L1258
Thanks for the update.
Here is the v4 patch with changes required in slotfuncs.c and slotsync.c files.
--
Thanks,
Nisha
Attachments:
v4-0001-Improve-the-connection-failure-error-messages.patchapplication/octet-stream; name=v4-0001-Improve-the-connection-failure-error-messages.patchDownload
From 576d7ae83aa4700496919303c1b731a4f612ca60 Mon Sep 17 00:00:00 2001
From: Nisha Moond <nisha.moond412@gmail.com>
Date: Wed, 10 Jan 2024 12:04:18 +0530
Subject: [PATCH v4] Improve the connection failure error messages
Enhances the error messages for walrcv_connect() connection failures,
offering more details on the processes that failed to establish a connection
---
src/backend/commands/subscriptioncmds.c | 6 +++---
src/backend/replication/logical/slotsync.c | 2 +-
src/backend/replication/logical/tablesync.c | 3 ++-
src/backend/replication/logical/worker.c | 3 ++-
src/backend/replication/slotfuncs.c | 2 +-
src/backend/replication/walreceiver.c | 10 ++++++----
src/test/regress/expected/subscription.out | 2 +-
7 files changed, 16 insertions(+), 12 deletions(-)
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 5a47fa984d..f40c16e0d7 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -762,7 +762,7 @@ CreateSubscription(ParseState *pstate, CreateSubscriptionStmt *stmt,
if (!wrconn)
ereport(ERROR,
(errcode(ERRCODE_CONNECTION_FAILURE),
- errmsg("could not connect to the publisher: %s", err)));
+ errmsg("subscription \"%s\" could not connect to the publisher: %s", stmt->subname, err)));
PG_TRY();
{
@@ -910,7 +910,7 @@ AlterSubscription_refresh(Subscription *sub, bool copy_data,
if (!wrconn)
ereport(ERROR,
(errcode(ERRCODE_CONNECTION_FAILURE),
- errmsg("could not connect to the publisher: %s", err)));
+ errmsg("subscription \"%s\" could not connect to the publisher: %s", sub->name, err)));
PG_TRY();
{
@@ -1537,7 +1537,7 @@ AlterSubscription(ParseState *pstate, AlterSubscriptionStmt *stmt,
if (!wrconn)
ereport(ERROR,
(errcode(ERRCODE_CONNECTION_FAILURE),
- errmsg("could not connect to the publisher: %s", err)));
+ errmsg("subscription \"%s\" could not connect to the publisher: %s", sub->name, err)));
PG_TRY();
{
diff --git a/src/backend/replication/logical/slotsync.c b/src/backend/replication/logical/slotsync.c
index 7b180bdb5c..ae544e8e81 100644
--- a/src/backend/replication/logical/slotsync.c
+++ b/src/backend/replication/logical/slotsync.c
@@ -1259,7 +1259,7 @@ ReplSlotSyncWorkerMain(char *startup_data, size_t startup_data_len)
if (!wrconn)
ereport(ERROR,
errcode(ERRCODE_CONNECTION_FAILURE),
- errmsg("could not connect to the primary server: %s", err));
+ errmsg("\"%s\" could not connect to the primary server: %s", app_name.data, err));
/*
* Register the failure callback once we have the connection.
diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index 1061d5b61b..341e905e72 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -1335,7 +1335,8 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
if (LogRepWorkerWalRcvConn == NULL)
ereport(ERROR,
(errcode(ERRCODE_CONNECTION_FAILURE),
- errmsg("could not connect to the publisher: %s", err)));
+ errmsg("logical replication table synchronization worker for subscription \"%s\" could not connect to the publisher: %s",
+ MySubscription->name, err)));
Assert(MyLogicalRepWorker->relstate == SUBREL_STATE_INIT ||
MyLogicalRepWorker->relstate == SUBREL_STATE_DATASYNC ||
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index b5a80fe3e8..85609bbd86 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -4500,7 +4500,8 @@ run_apply_worker()
if (LogRepWorkerWalRcvConn == NULL)
ereport(ERROR,
(errcode(ERRCODE_CONNECTION_FAILURE),
- errmsg("could not connect to the publisher: %s", err)));
+ errmsg("logical replication apply worker for subscription \"%s\" could not connect to the publisher: %s",
+ MySubscription->name, err)));
/*
* We don't really use the output identify_system for anything but it does
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index ad79e1fccd..0b3714b59d 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -993,7 +993,7 @@ pg_sync_replication_slots(PG_FUNCTION_ARGS)
if (!wrconn)
ereport(ERROR,
errcode(ERRCODE_CONNECTION_FAILURE),
- errmsg("could not connect to the primary server: %s", err));
+ errmsg("\"%s\" could not connect to the primary server: %s", app_name.data, err));
SyncReplicationSlots(wrconn);
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index acda5f68d9..2c1329b867 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -195,6 +195,7 @@ WalReceiverMain(char *startup_data, size_t startup_data_len)
char *err;
char *sender_host = NULL;
int sender_port = 0;
+ char *appname = NULL;
Assert(startup_data_len == 0);
@@ -297,14 +298,15 @@ WalReceiverMain(char *startup_data, size_t startup_data_len)
/* Unblock signals (they were blocked when the postmaster forked us) */
sigprocmask(SIG_SETMASK, &UnBlockSig, NULL);
+ appname = cluster_name[0] ? cluster_name : "walreceiver";
+
/* Establish the connection to the primary for XLOG streaming */
- wrconn = walrcv_connect(conninfo, true, false, false,
- cluster_name[0] ? cluster_name : "walreceiver",
- &err);
+ wrconn = walrcv_connect(conninfo, true, false, false, appname, &err);
+
if (!wrconn)
ereport(ERROR,
(errcode(ERRCODE_CONNECTION_FAILURE),
- errmsg("could not connect to the primary server: %s", err)));
+ errmsg("\"%s\" could not connect to the primary server: %s", appname, err)));
/*
* Save user-visible connection string. This clobbers the original
diff --git a/src/test/regress/expected/subscription.out b/src/test/regress/expected/subscription.out
index 1eee6b17b8..5c9f8e47de 100644
--- a/src/test/regress/expected/subscription.out
+++ b/src/test/regress/expected/subscription.out
@@ -141,7 +141,7 @@ ERROR: invalid connection string syntax: invalid connection option "i_dont_exis
-- fail, connection string parses, but doesn't work (and does so without
-- connecting, so this is reliable and safe)
CREATE SUBSCRIPTION regress_testsub5 CONNECTION 'port=-1' PUBLICATION testpub;
-ERROR: could not connect to the publisher: invalid port number: "-1"
+ERROR: subscription "regress_testsub5" could not connect to the publisher: invalid port number: "-1"
-- fail - invalid connection string during ALTER
ALTER SUBSCRIPTION regress_testsub CONNECTION 'foobar';
ERROR: invalid connection string syntax: missing "=" after "foobar" in connection info string
--
2.34.1
Hi, just by visual inspection of the v3/v4 patch diffs, the latest v4 LGTM.
======
Kind Regards,
Peter Smith.
Fujitsu Australia
On 22 Mar 2024, at 11:42, Nisha Moond <nisha.moond412@gmail.com> wrote:
Here is the v4 patch with changes required in slotfuncs.c and slotsync.c files.
- errmsg("could not connect to the primary server: %s", err));
+ errmsg("\"%s\" could not connect to the primary server: %s", app_name.data, err));
Messages like this should perhaps have translator comments to indicate what the
leading "%s" will contain?
--
Daniel Gustafsson
On Fri, Apr 26, 2024 at 1:10 PM Daniel Gustafsson <daniel@yesql.se> wrote:
On 22 Mar 2024, at 11:42, Nisha Moond <nisha.moond412@gmail.com> wrote:
Here is the v4 patch with changes required in slotfuncs.c and slotsync.c files.
- errmsg("could not connect to the primary server: %s", err)); + errmsg("\"%s\" could not connect to the primary server: %s", app_name.data, err));Messages like this should perhaps have translator comments to indicate what the
leading "%s" will contain?
Attached v5 patch with the translator comments as suggested.
--
Thanks,
Nisha
Attachments:
v5-0001-Improve-the-connection-failure-error-messages.patchapplication/octet-stream; name=v5-0001-Improve-the-connection-failure-error-messages.patchDownload
From 5a384888477dcf4125459e46da58c419a17e1715 Mon Sep 17 00:00:00 2001
From: Nisha Moond <nisha.moond412@gmail.com>
Date: Fri, 31 May 2024 17:52:04 +0530
Subject: [PATCH v5] Improve the connection failure error messages
Enhances the error messages for walrcv_connect() connection failures,
offering more details on the processes that failed to establish a connection.
---
src/backend/commands/subscriptioncmds.c | 21 ++++++++++++++++++---
src/backend/replication/logical/slotsync.c | 7 ++++++-
src/backend/replication/logical/tablesync.c | 8 +++++++-
src/backend/replication/logical/worker.c | 8 +++++++-
src/backend/replication/slotfuncs.c | 7 ++++++-
src/backend/replication/walreceiver.c | 11 +++++++----
src/test/regress/expected/subscription.out | 2 +-
7 files changed, 52 insertions(+), 12 deletions(-)
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index e407428dbc..23db66d82b 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -755,7 +755,12 @@ CreateSubscription(ParseState *pstate, CreateSubscriptionStmt *stmt,
if (!wrconn)
ereport(ERROR,
(errcode(ERRCODE_CONNECTION_FAILURE),
- errmsg("could not connect to the publisher: %s", err)));
+
+ /*
+ * translator: first %s is the subscription name, second %s is the
+ * error
+ */
+ errmsg("subscription \"%s\" could not connect to the publisher: %s", stmt->subname, err)));
PG_TRY();
{
@@ -888,7 +893,12 @@ AlterSubscription_refresh(Subscription *sub, bool copy_data,
if (!wrconn)
ereport(ERROR,
(errcode(ERRCODE_CONNECTION_FAILURE),
- errmsg("could not connect to the publisher: %s", err)));
+
+ /*
+ * translator: first %s is the subscription name, second %s is the
+ * error
+ */
+ errmsg("subscription \"%s\" could not connect to the publisher: %s", sub->name, err)));
PG_TRY();
{
@@ -1521,7 +1531,12 @@ AlterSubscription(ParseState *pstate, AlterSubscriptionStmt *stmt,
if (!wrconn)
ereport(ERROR,
(errcode(ERRCODE_CONNECTION_FAILURE),
- errmsg("could not connect to the publisher: %s", err)));
+
+ /*
+ * translator: first %s is the subscription name, second %s is the
+ * error
+ */
+ errmsg("subscription \"%s\" could not connect to the publisher: %s", sub->name, err)));
PG_TRY();
{
diff --git a/src/backend/replication/logical/slotsync.c b/src/backend/replication/logical/slotsync.c
index 56d3fb5d0e..632339a5fc 100644
--- a/src/backend/replication/logical/slotsync.c
+++ b/src/backend/replication/logical/slotsync.c
@@ -1463,7 +1463,12 @@ ReplSlotSyncWorkerMain(char *startup_data, size_t startup_data_len)
if (!wrconn)
ereport(ERROR,
errcode(ERRCODE_CONNECTION_FAILURE),
- errmsg("could not connect to the primary server: %s", err));
+
+ /*
+ * translator: first %s is the slotsync worker name, second %s is the
+ * error
+ */
+ errmsg("\"%s\" could not connect to the primary server: %s", app_name.data, err));
/*
* Register the disconnection callback.
diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index b00267f042..da927e443d 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -1342,7 +1342,13 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
if (LogRepWorkerWalRcvConn == NULL)
ereport(ERROR,
(errcode(ERRCODE_CONNECTION_FAILURE),
- errmsg("could not connect to the publisher: %s", err)));
+
+ /*
+ * translator: first %s is the subscription name, second %s is the
+ * error
+ */
+ errmsg("logical replication table synchronization worker for subscription \"%s\" could not connect to the publisher: %s",
+ MySubscription->name, err)));
Assert(MyLogicalRepWorker->relstate == SUBREL_STATE_INIT ||
MyLogicalRepWorker->relstate == SUBREL_STATE_DATASYNC ||
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index b5a80fe3e8..2100c78c1e 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -4500,7 +4500,13 @@ run_apply_worker()
if (LogRepWorkerWalRcvConn == NULL)
ereport(ERROR,
(errcode(ERRCODE_CONNECTION_FAILURE),
- errmsg("could not connect to the publisher: %s", err)));
+
+ /*
+ * translator: first %s is the subscription name, second %s is the
+ * error
+ */
+ errmsg("logical replication apply worker for subscription \"%s\" could not connect to the publisher: %s",
+ MySubscription->name, err)));
/*
* We don't really use the output identify_system for anything but it does
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index dd6c1d5a7e..30a222c497 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -897,7 +897,12 @@ pg_sync_replication_slots(PG_FUNCTION_ARGS)
if (!wrconn)
ereport(ERROR,
errcode(ERRCODE_CONNECTION_FAILURE),
- errmsg("could not connect to the primary server: %s", err));
+
+ /*
+ * translator: first %s is the slotsync worker name, second %s is the
+ * error
+ */
+ errmsg("\"%s\" could not connect to the primary server: %s", app_name.data, err));
SyncReplicationSlots(wrconn);
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index acda5f68d9..34ee9ecb89 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -195,6 +195,7 @@ WalReceiverMain(char *startup_data, size_t startup_data_len)
char *err;
char *sender_host = NULL;
int sender_port = 0;
+ char *appname = NULL;
Assert(startup_data_len == 0);
@@ -297,14 +298,16 @@ WalReceiverMain(char *startup_data, size_t startup_data_len)
/* Unblock signals (they were blocked when the postmaster forked us) */
sigprocmask(SIG_SETMASK, &UnBlockSig, NULL);
+ appname = cluster_name[0] ? cluster_name : "walreceiver";
+
/* Establish the connection to the primary for XLOG streaming */
- wrconn = walrcv_connect(conninfo, true, false, false,
- cluster_name[0] ? cluster_name : "walreceiver",
- &err);
+ wrconn = walrcv_connect(conninfo, true, false, false, appname, &err);
+
if (!wrconn)
ereport(ERROR,
(errcode(ERRCODE_CONNECTION_FAILURE),
- errmsg("could not connect to the primary server: %s", err)));
+ /* translator: first %s is the walreceiver, second %s is the error */
+ errmsg("\"%s\" could not connect to the primary server: %s", appname, err)));
/*
* Save user-visible connection string. This clobbers the original
diff --git a/src/test/regress/expected/subscription.out b/src/test/regress/expected/subscription.out
index 0f2a25cdc1..5c2f1ee517 100644
--- a/src/test/regress/expected/subscription.out
+++ b/src/test/regress/expected/subscription.out
@@ -139,7 +139,7 @@ ERROR: invalid connection string syntax: invalid connection option "i_dont_exis
-- fail, connection string parses, but doesn't work (and does so without
-- connecting, so this is reliable and safe)
CREATE SUBSCRIPTION regress_testsub5 CONNECTION 'port=-1' PUBLICATION testpub;
-ERROR: could not connect to the publisher: invalid port number: "-1"
+ERROR: subscription "regress_testsub5" could not connect to the publisher: invalid port number: "-1"
-- fail - invalid connection string during ALTER
ALTER SUBSCRIPTION regress_testsub CONNECTION 'foobar';
ERROR: invalid connection string syntax: missing "=" after "foobar" in connection info string
--
2.34.1
Hi, just by visual inspection of the v4/v5 patch diffs, the latest v5 LGTM.
======
Kind Regards,
Peter Smith.
Fujitsu Australia
FYI - I created a CF entry [1]https://commitfest.postgresql.org/48/5075/ for this because AFAIK the patch is
just waiting for a committer to check if it is OK to be pushed, but
maybe nobody has noticed it.
======
[1]: https://commitfest.postgresql.org/48/5075/
Kind Regards,
Peter Smith.
Fujitsu Australia
On 26 Jun 2024, at 05:46, Peter Smith <smithpb2250@gmail.com> wrote:
FYI - I created a CF entry [1] for this because AFAIK the patch is
just waiting for a committer to check if it is OK to be pushed, but
maybe nobody has noticed it.
Thanks, always good to have a CF entry to track it with. It’s not forgotten though, just waiting for the tree to be branched. Right now focus is on stabilizing what will become v17.
./daniel
Nisha Moond <nisha.moond412@gmail.com> writes:
Attached v5 patch with the translator comments as suggested.
I looked at this, and I agree with the goal, but I find just about all
of the translator comments unnecessary. The ones that are useful are
useful only because the message is violating one of our message style
guidelines [1]https://www.postgresql.org/docs/devel/error-style-guide.html#ERROR-STYLE-GUIDE-OBJECT-TYPE:
When citing the name of an object, state what kind of object it is.
Rationale: Otherwise no one will know what “foo.bar.baz” refers to.
So, for example, where you have
+
+ /*
+ * translator: first %s is the subscription name, second %s is the
+ * error
+ */
+ errmsg("subscription \"%s\" could not connect to the publisher: %s", stmt->subname, err)));
I don't find that that translator comment is adding anything.
But there are a couple of places like
+ /*
+ * translator: first %s is the slotsync worker name, second %s is the
+ * error
+ */
+ errmsg("\"%s\" could not connect to the primary server: %s", app_name.data, err));
I think that the right cure for the ambiguity here is not to add a
translator comment, but to label the name properly, perhaps like
errmsg("synchronization worker \"%s\" could not connect to the primary server: %s",
app_name.data, err));
regards, tom lane
[1]: https://www.postgresql.org/docs/devel/error-style-guide.html#ERROR-STYLE-GUIDE-OBJECT-TYPE
On Tue, Jul 9, 2024 at 1:00 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Nisha Moond <nisha.moond412@gmail.com> writes:
Attached v5 patch with the translator comments as suggested.
I looked at this, and I agree with the goal, but I find just about all
of the translator comments unnecessary. The ones that are useful are
useful only because the message is violating one of our message style
guidelines [1]:When citing the name of an object, state what kind of object it is.
Rationale: Otherwise no one will know what “foo.bar.baz” refers to.
So, for example, where you have
+ + /* + * translator: first %s is the subscription name, second %s is the + * error + */ + errmsg("subscription \"%s\" could not connect to the publisher: %s", stmt->subname, err)));I don't find that that translator comment is adding anything.
But there are a couple of places like+ /* + * translator: first %s is the slotsync worker name, second %s is the + * error + */ + errmsg("\"%s\" could not connect to the primary server: %s", app_name.data, err));I think that the right cure for the ambiguity here is not to add a
translator comment, but to label the name properly, perhaps likeerrmsg("synchronization worker \"%s\" could not connect to the primary server: %s",
app_name.data, err));regards, tom lane
[1] https://www.postgresql.org/docs/devel/error-style-guide.html#ERROR-STYLE-GUIDE-OBJECT-TYPE
Thank you for the review.
Attached the patch v6 with suggested improvements.
- Removed unnecessary translator comments.
- Added appropriate identifier names where missing.
--
Thanks,
Nisha
Attachments:
v6-0001-Improve-the-connection-failure-error-messages.patchapplication/octet-stream; name=v6-0001-Improve-the-connection-failure-error-messages.patchDownload
From 3f260ed2e617c3f27f5a2176e8a172c79186c6de Mon Sep 17 00:00:00 2001
From: Nisha Moond <nisha.moond412@gmail.com>
Date: Fri, 31 May 2024 17:52:04 +0530
Subject: [PATCH v6] Improve the connection failure error messages
Enhances the error messages for walrcv_connect() connection failures,
offering more details on the processes that failed to establish a connection.
---
src/backend/commands/subscriptioncmds.c | 9 ++++++---
src/backend/replication/logical/slotsync.c | 3 ++-
src/backend/replication/logical/tablesync.c | 3 ++-
src/backend/replication/logical/worker.c | 3 ++-
src/backend/replication/slotfuncs.c | 3 ++-
src/backend/replication/walreceiver.c | 10 ++++++----
src/test/regress/expected/subscription.out | 2 +-
7 files changed, 21 insertions(+), 12 deletions(-)
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index e407428dbc..16d83b3253 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -755,7 +755,8 @@ CreateSubscription(ParseState *pstate, CreateSubscriptionStmt *stmt,
if (!wrconn)
ereport(ERROR,
(errcode(ERRCODE_CONNECTION_FAILURE),
- errmsg("could not connect to the publisher: %s", err)));
+ errmsg("subscription \"%s\" could not connect to the publisher: %s",
+ stmt->subname, err)));
PG_TRY();
{
@@ -888,7 +889,8 @@ AlterSubscription_refresh(Subscription *sub, bool copy_data,
if (!wrconn)
ereport(ERROR,
(errcode(ERRCODE_CONNECTION_FAILURE),
- errmsg("could not connect to the publisher: %s", err)));
+ errmsg("subscription \"%s\" could not connect to the publisher: %s",
+ sub->name, err)));
PG_TRY();
{
@@ -1521,7 +1523,8 @@ AlterSubscription(ParseState *pstate, AlterSubscriptionStmt *stmt,
if (!wrconn)
ereport(ERROR,
(errcode(ERRCODE_CONNECTION_FAILURE),
- errmsg("could not connect to the publisher: %s", err)));
+ errmsg("subscription \"%s\" could not connect to the publisher: %s",
+ sub->name, err)));
PG_TRY();
{
diff --git a/src/backend/replication/logical/slotsync.c b/src/backend/replication/logical/slotsync.c
index fe2f07cf44..2d1914ce08 100644
--- a/src/backend/replication/logical/slotsync.c
+++ b/src/backend/replication/logical/slotsync.c
@@ -1463,7 +1463,8 @@ ReplSlotSyncWorkerMain(char *startup_data, size_t startup_data_len)
if (!wrconn)
ereport(ERROR,
errcode(ERRCODE_CONNECTION_FAILURE),
- errmsg("could not connect to the primary server: %s", err));
+ errmsg("synchronization worker \"%s\" could not connect to the primary server: %s",
+ app_name.data, err));
/*
* Register the disconnection callback.
diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index b00267f042..7bd0f9d7e0 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -1342,7 +1342,8 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
if (LogRepWorkerWalRcvConn == NULL)
ereport(ERROR,
(errcode(ERRCODE_CONNECTION_FAILURE),
- errmsg("could not connect to the publisher: %s", err)));
+ errmsg("logical replication table synchronization worker for subscription \"%s\" could not connect to the publisher: %s",
+ MySubscription->name, err)));
Assert(MyLogicalRepWorker->relstate == SUBREL_STATE_INIT ||
MyLogicalRepWorker->relstate == SUBREL_STATE_DATASYNC ||
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 3b285894db..4d2f3d63d1 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -4500,7 +4500,8 @@ run_apply_worker()
if (LogRepWorkerWalRcvConn == NULL)
ereport(ERROR,
(errcode(ERRCODE_CONNECTION_FAILURE),
- errmsg("could not connect to the publisher: %s", err)));
+ errmsg("logical replication apply worker for subscription \"%s\" could not connect to the publisher: %s",
+ MySubscription->name, err)));
/*
* We don't really use the output identify_system for anything but it does
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index 38595b3a47..c7bfbb15e0 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -898,7 +898,8 @@ pg_sync_replication_slots(PG_FUNCTION_ARGS)
if (!wrconn)
ereport(ERROR,
errcode(ERRCODE_CONNECTION_FAILURE),
- errmsg("could not connect to the primary server: %s", err));
+ errmsg("synchronization worker \"%s\" could not connect to the primary server: %s",
+ app_name.data, err));
SyncReplicationSlots(wrconn);
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index acda5f68d9..7506774f75 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -195,6 +195,7 @@ WalReceiverMain(char *startup_data, size_t startup_data_len)
char *err;
char *sender_host = NULL;
int sender_port = 0;
+ char *appname = NULL;
Assert(startup_data_len == 0);
@@ -297,14 +298,15 @@ WalReceiverMain(char *startup_data, size_t startup_data_len)
/* Unblock signals (they were blocked when the postmaster forked us) */
sigprocmask(SIG_SETMASK, &UnBlockSig, NULL);
+ appname = cluster_name[0] ? cluster_name : "walreceiver";
+
/* Establish the connection to the primary for XLOG streaming */
- wrconn = walrcv_connect(conninfo, true, false, false,
- cluster_name[0] ? cluster_name : "walreceiver",
- &err);
+ wrconn = walrcv_connect(conninfo, true, false, false, appname, &err);
+
if (!wrconn)
ereport(ERROR,
(errcode(ERRCODE_CONNECTION_FAILURE),
- errmsg("could not connect to the primary server: %s", err)));
+ errmsg("streaming replication \"%s\" could not connect to the primary server: %s", appname, err)));
/*
* Save user-visible connection string. This clobbers the original
diff --git a/src/test/regress/expected/subscription.out b/src/test/regress/expected/subscription.out
index 0f2a25cdc1..5c2f1ee517 100644
--- a/src/test/regress/expected/subscription.out
+++ b/src/test/regress/expected/subscription.out
@@ -139,7 +139,7 @@ ERROR: invalid connection string syntax: invalid connection option "i_dont_exis
-- fail, connection string parses, but doesn't work (and does so without
-- connecting, so this is reliable and safe)
CREATE SUBSCRIPTION regress_testsub5 CONNECTION 'port=-1' PUBLICATION testpub;
-ERROR: could not connect to the publisher: invalid port number: "-1"
+ERROR: subscription "regress_testsub5" could not connect to the publisher: invalid port number: "-1"
-- fail - invalid connection string during ALTER
ALTER SUBSCRIPTION regress_testsub CONNECTION 'foobar';
ERROR: invalid connection string syntax: missing "=" after "foobar" in connection info string
--
2.34.1
Nisha Moond <nisha.moond412@gmail.com> writes:
Attached the patch v6 with suggested improvements.
- Removed unnecessary translator comments.
- Added appropriate identifier names where missing.
I think this is generally OK, although one could argue that it's
violating our message style guideline that primary error messages
should be short [1]https://www.postgresql.org/docs/devel/error-style-guide.html#ERROR-STYLE-GUIDE-WHAT-GOES-WHERE. The text itself isn't that bad, but once you
tack on a libpq connection failure message it's hard to claim that
the result "fits on one line".
Another way we could address this that'd reduce that problem is to
leave the primary messages as-is and add an errdetail or errcontext
message to show what's throwing the error. However, I'm not convinced
that's better. The main argument against it is that detail/context
lines can get lost, eg if you're running the server in terse logging
mode.
On balance I think it's OK, so I pushed it. I did take out a couple
of uses of "logical replication" that seemed unnecessary and were
making the length problem worse.
regards, tom lane
[1]: https://www.postgresql.org/docs/devel/error-style-guide.html#ERROR-STYLE-GUIDE-WHAT-GOES-WHERE