Bug in ALTER SUBSCRIPTION ... SERVER / ... CONNECTION with broken old server
Hi,
I tested the v19 new feature CREATE SUBSCRIPTION ... SERVER yesterday, and found an issue: once the old server becomes broken, the subscription cannot be recovered by switching it to a good server.
Here is a repro:
```
evantest=# CREATE EXTENSION postgres_fdw;
CREATE EXTENSION
# Create two servers
evantest=# CREATE SERVER old_srv FOREIGN DATA WRAPPER postgres_fdw
evantest-# OPTIONS (host '127.0.0.1', dbname 'postgres', port '5432');
CREATE SERVER
evantest=# CREATE SERVER new_srv FOREIGN DATA WRAPPER postgres_fdw
evantest-# OPTIONS (host '127.0.0.1', dbname 'postgres', port '5432');
CREATE SERVER
evantest=# CREATE USER MAPPING FOR CURRENT_USER SERVER old_srv
evantest-# OPTIONS (user 'dummy', password 'dummy');
CREATE USER MAPPING
evantest=# CREATE USER MAPPING FOR CURRENT_USER SERVER new_srv
evantest-# OPTIONS (user 'dummy', password 'dummy');
CREATE USER MAPPING
# Add old_server to a subscription
evantest=# CREATE SUBSCRIPTION sub_bug SERVER old_srv
evantest-# PUBLICATION pub_does_not_matter
evantest-# WITH (connect = false, slot_name = NONE);
WARNING: subscription was created, but is not connected
HINT: To initiate replication, you must manually create the replication slot, enable the subscription, and alter the subscription to refresh publications.
CREATE SUBSCRIPTION
# Break old_srv
evantest=# DROP USER MAPPING FOR CURRENT_USER SERVER old_srv;
DROP USER MAPPING
# Fail to switch to a good server because old_srv is broken
evantest=# ALTER SUBSCRIPTION sub_bug SERVER new_srv;
ERROR: user mapping not found for user "chaol", server "old_srv"
evantest=# ALTER SUBSCRIPTION sub_bug CONNECTION 'host=127.0.0.1 dbname=postgres port=5432 user=dummy password=dummy';
ERROR: user mapping not found for user "chaol", server "old_srv"
```
In AlterSubscription(), this comment made me think this is a bug:
```
/*
* Skip ACL checks on the subscription's foreign server, if any. If
* changing the server (or replacing it with a raw connection), then the
* old one will be removed anyway. If changing something unrelated,
* there's no need to do an additional ACL check here; that will be done
* by the subscription worker anyway.
*/
sub = GetSubscription(subid, false, false);
```
The comment explicitly says to skip ACL checks on the old server because it will be removed anyway.
But after looking into GetSubscription(), I found that when the aclcheck parameter is false, it still calls ForeignServerConnectionString(). I think that is the root cause of the bug.
To fix this, I worked out a solution that stores the server OID in Subscription, and only calls ForeignServerConnectionString()lazily when sub->conninfo is actually needed. I also added a new test case to cover this scenario. Without the fix, the new test fails.
See attached patch for details.
Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/
Attachments:
v1-0001-Allow-altering-subscription-server-connection-wit.patchapplication/octet-stream; name=v1-0001-Allow-altering-subscription-server-connection-wit.patch; x-unix-mode=0644Download+106-14
Dear Chao,
I tested the v19 new feature CREATE SUBSCRIPTION ... SERVER yesterday, and
found an issue: once the old server becomes broken, the subscription cannot be
recovered by switching it to a good server.
Thanks for testing. I could reproduce the same issue. In addition to yours, I found
DROP SUBSCRIPTION cannot be done anymore. To switch the connection or drop it,
I had to create the same user mapping must be created again.
```
postgres=# DROP SUBSCRIPTION sub_bug ;
ERROR: user mapping not found for user "postgres", server "old_srv"
postgres=# CREATE USER MAPPING FOR CURRENT_USER SERVER old_srv
OPTIONS (user 'dummy', password 'dummy');
CREATE USER MAPPING
postgres=# DROP SUBSCRIPTION sub_bug ;
DROP SUBSCRIPTION
```
Before deep dive to your fix, I'm unclear why dropping the active USER MAPPING is
allowed. Personally, it should be avoided anyway. Do you know why it's not restricted?
Best regards,
Hayato Kuroda
FUJITSU LIMITED
On Apr 22, 2026, at 20:35, Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> wrote:
Dear Chao,
I tested the v19 new feature CREATE SUBSCRIPTION ... SERVER yesterday, and
found an issue: once the old server becomes broken, the subscription cannot be
recovered by switching it to a good server.Thanks for testing. I could reproduce the same issue. In addition to yours, I found
DROP SUBSCRIPTION cannot be done anymore. To switch the connection or drop it,
I had to create the same user mapping must be created again.```
postgres=# DROP SUBSCRIPTION sub_bug ;
ERROR: user mapping not found for user "postgres", server "old_srv"
postgres=# CREATE USER MAPPING FOR CURRENT_USER SERVER old_srv
OPTIONS (user 'dummy', password 'dummy');
CREATE USER MAPPING
postgres=# DROP SUBSCRIPTION sub_bug ;
DROP SUBSCRIPTION
```Before deep dive to your fix, I'm unclear why dropping the active USER MAPPING is
allowed. Personally, it should be avoided anyway. Do you know why it's not restricted?Best regards,
Hayato Kuroda
FUJITSU LIMITED
Hi Hayato-san,
There is an existing test case in subscription.sql:
```
-- fail, must connect but lacks USAGE on server, as well as user mapping
DROP SUBSCRIPTION regress_testsub6;
```
So, I guess that’s an intentional behavior. You have to fix the broken server or switch to a good one before dropping the subscription. That’s my understanding from the test cases.
Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/
On Wed, Apr 22, 2026 at 11:52 AM Chao Li <li.evan.chao@gmail.com> wrote:
Hi,
The comment explicitly says to skip ACL checks on the old server because it will be removed anyway.
But after looking into GetSubscription(), I found that when the aclcheck parameter is false, it still calls ForeignServerConnectionString(). I think that is the root cause of the bug.
To fix this, I worked out a solution that stores the server OID in Subscription, and only calls ForeignServerConnectionString()lazily when sub->conninfo is actually needed. I also added a new test case to cover this scenario. Without the fix, the new test fails.
See attached patch for details.
Hi Li,
Thanks for the patch.
Some comments:
1.
if (aclresult != ACLCHECK_OK)
ereport(ERROR,
- (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
- errmsg("subscription owner \"%s\" does not
have permission on foreign server \"%s\"",
- GetUserNameFromId(subform->subowner, false),
- server->servername)));
+ errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+ errmsg("subscription owner \"%s\" does not
have permission on foreign server \"%s\"",
+ GetUserNameFromId(subform->subowner, false),
+ server->servername));
+ sub->conninfo = ForeignServerConnectionString(subform->subowner,
+ server);
}
Add a new line before the call to ForeignServerConnectionString(),
also I think you should put the if condition within curly brackets
because it spans more than one line and might confuse developers while
adding new code.
2. I think you should add a comment in the function header above
GetSubscription() stating that if aclcheck is false, then the conninfo
will be set to null and users need to call GetSubscriptionConnInfo to
get the conninfo.
3.
Datum
test_fdw_connection(PG_FUNCTION_ARGS)
{
+ GetUserMapping(PG_GETARG_OID(0), PG_GETARG_OID(1));
PG_RETURN_TEXT_P(cstring_to_text("dbname=regress_doesnotexist
user=doesnotexist password=secret"));
}
Add a comment above this change describing why it's required.
regards,
Ajin Cherian
Fujitsu Australia
On Apr 29, 2026, at 12:44, Ajin Cherian <itsajin@gmail.com> wrote:
On Wed, Apr 22, 2026 at 11:52 AM Chao Li <li.evan.chao@gmail.com> wrote:
Hi,
The comment explicitly says to skip ACL checks on the old server because it will be removed anyway.
But after looking into GetSubscription(), I found that when the aclcheck parameter is false, it still calls ForeignServerConnectionString(). I think that is the root cause of the bug.
To fix this, I worked out a solution that stores the server OID in Subscription, and only calls ForeignServerConnectionString()lazily when sub->conninfo is actually needed. I also added a new test case to cover this scenario. Without the fix, the new test fails.
See attached patch for details.Hi Li,
Thanks for the patch.
Thanks for your review.
Some comments:
1. if (aclresult != ACLCHECK_OK) ereport(ERROR, - (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), - errmsg("subscription owner \"%s\" does not have permission on foreign server \"%s\"", - GetUserNameFromId(subform->subowner, false), - server->servername))); + errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), + errmsg("subscription owner \"%s\" does not have permission on foreign server \"%s\"", + GetUserNameFromId(subform->subowner, false), + server->servername)); + sub->conninfo = ForeignServerConnectionString(subform->subowner, + server); }Add a new line before the call to ForeignServerConnectionString(),
Okay.
also I think you should put the if condition within curly brackets
because it spans more than one line and might confuse developers while
adding new code.
I don’t think curly brackets are needed as there is only one statement under the if though the statement spans multiple lines. There are plenty of examples in the existing code, for example:
```
if (!wrconn)
ereport(ERROR,
errcode(ERRCODE_CONNECTION_FAILURE),
errmsg("subscription \"%s\" could not connect to the publisher: %s",
sub->name, err));
```
2. I think you should add a comment in the function header above
GetSubscription() stating that if aclcheck is false, then the conninfo
will be set to null and users need to call GetSubscriptionConnInfo to
get the conninfo.
Updated the header comment.
3.
Datum
test_fdw_connection(PG_FUNCTION_ARGS)
{
+ GetUserMapping(PG_GETARG_OID(0), PG_GETARG_OID(1));
PG_RETURN_TEXT_P(cstring_to_text("dbname=regress_doesnotexist
user=doesnotexist password=secret"));
}Add a comment above this change describing why it's required.
Added the comment.
Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/
Attachments:
v2-0001-Allow-altering-subscription-server-connection-wit.patchapplication/octet-stream; name=v2-0001-Allow-altering-subscription-server-connection-wit.patch; x-unix-mode=0644Download+116-14
On Thu, Apr 30, 2026 at 2:12 PM Chao Li <li.evan.chao@gmail.com> wrote:
2. I think you should add a comment in the function header above
GetSubscription() stating that if aclcheck is false, then the conninfo
will be set to null and users need to call GetSubscriptionConnInfo to
get the conninfo.Updated the header comment.
/*
* Fetch the subscription from the syscache.
+ *
+ * If missing_ok is false, throw an error if the subscription is not found.
+ * If true, return NULL in that case.
+ *
+ * If aclcheck is true, check whether the subscription owner has permission on
+ * the subscription's foreign server, and load the connection string from the
+ * foreign server. Later, GetSubscriptionConnInfo() should be called to get
+ * the connection string.
*/
Subscription *
GetSubscription(Oid subid, bool missing_ok, bool aclcheck)
I don't think this comment is right. If aclcheck is true, users need
not call GetSubscriptionConnInfo() to get the connection string, as it
is populated in this function itself. It is only if aclecheck is
false, do callers need to do that.
Isn't that the case? There are multiple places in the apply worker
where GetSubscription() is called with aclcheck is true and
GetSubscriptionConnInfo() is not called there.
regards,
Ajin Cherian
Fujitsu Australia
Hello
- server = GetForeignServer(subform->subserver);
+ server = GetForeignServer(sub->serverid);
Couldn't we also move this inside the if?
+/*
+ * Return the subscription's connection string, loading it into the
+ * subscription memory context if necessary.
+ *
+ * GetSubscription must be called earlier to set sub->serverid, because ACL
+ * checks are performed there.
+ */
+char *
+GetSubscriptionConnInfo(Subscription *sub)
This is related to Ajin's comment earlier, the part about ACL check
seems incorrect to me.
On May 1, 2026, at 11:58, Ajin Cherian <itsajin@gmail.com> wrote:
On Thu, Apr 30, 2026 at 2:12 PM Chao Li <li.evan.chao@gmail.com> wrote:
2. I think you should add a comment in the function header above
GetSubscription() stating that if aclcheck is false, then the conninfo
will be set to null and users need to call GetSubscriptionConnInfo to
get the conninfo.Updated the header comment.
/* * Fetch the subscription from the syscache. + * + * If missing_ok is false, throw an error if the subscription is not found. + * If true, return NULL in that case. + * + * If aclcheck is true, check whether the subscription owner has permission on + * the subscription's foreign server, and load the connection string from the + * foreign server. Later, GetSubscriptionConnInfo() should be called to get + * the connection string. */ Subscription * GetSubscription(Oid subid, bool missing_ok, bool aclcheck)I don't think this comment is right. If aclcheck is true, users need
not call GetSubscriptionConnInfo() to get the connection string, as it
is populated in this function itself. It is only if aclecheck is
false, do callers need to do that.
Isn't that the case? There are multiple places in the apply worker
where GetSubscription() is called with aclcheck is true and
GetSubscriptionConnInfo() is not called there.regards,
Ajin Cherian
Fujitsu Australia
Hi Ajin,
Thank you for the comment. After rereading that part, I agree the wording is not clear.
What I meant is that GetSubscriptionConnInfo() is a safe accessor, if sub->conninfo has already been resolved, it just returns it; otherwise, it resolves it on demand. So the intended usage is that callers can consistently use GetSubscriptionConnInfo() when they need the connection string.
I also missed doing a broader search for direct uses of sub->conninfo and replacing them with GetSubscriptionConnInfo() where appropriate. Sorry about that.
I will address both issues in v3.
Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/
On May 6, 2026, at 04:53, Zsolt Parragi <zsolt.parragi@percona.com> wrote:
Hello
- server = GetForeignServer(subform->subserver); + server = GetForeignServer(sub->serverid);Couldn't we also move this inside the if?
Ah, true. Both aclresult and server can be moved into the if.
+/* + * Return the subscription's connection string, loading it into the + * subscription memory context if necessary. + * + * GetSubscription must be called earlier to set sub->serverid, because ACL + * checks are performed there. + */ +char * +GetSubscriptionConnInfo(Subscription *sub)This is related to Ajin's comment earlier, the part about ACL check
seems incorrect to me.
Yes, see my reply to Ajin in the previous email.
PFA v3 - addressed Ajin and Zsolt’s comments.
Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/
Attachments:
v3-0001-Allow-altering-subscription-server-connection-wit.patchapplication/octet-stream; name=v3-0001-Allow-altering-subscription-server-connection-wit.patch; x-unix-mode=0644Download+130-25
On Wed, 2026-05-06 at 15:57 +0800, Chao Li wrote:
PFA v3 - addressed Ajin and Zsolt’s comments.
Thank you for the report!
The proposed patch seems unnecessarily complex, though. It seems too
easy to add GetSubscriptionConninfo() in the wrong place and end up
with another problem that's not easily detected.
Can't we just do something like the attached? It's easy to explain at
the call site that, when changing to a different server or using
CONNECTION instead, that we don't need the old conninfo at all.
I included your test case in my patch, and it passes.
Also, Hayato Kuroda's report was an issue also because the error could
be thrown even if slotname was NULL. Patch attached for that, as well.
Thank you, also!
Regards,
Jeff Davis
Attachments:
v4-0001-Avoid-errors-during-ALTER-SUBSCRIPTION.patchtext/x-patch; charset=UTF-8; name=v4-0001-Avoid-errors-during-ALTER-SUBSCRIPTION.patchDownload+92-40
v4-0002-Avoid-errors-during-DROP-SUBSCRIPTION.patchtext/x-patch; charset=UTF-8; name=v4-0002-Avoid-errors-during-DROP-SUBSCRIPTION.patchDownload+37-35
On May 9, 2026, at 09:01, Jeff Davis <pgsql@j-davis.com> wrote:
On Wed, 2026-05-06 at 15:57 +0800, Chao Li wrote:
PFA v3 - addressed Ajin and Zsolt’s comments.
Thank you for the report!
The proposed patch seems unnecessarily complex, though. It seems too
easy to add GetSubscriptionConninfo() in the wrong place and end up
with another problem that's not easily detected.Can't we just do something like the attached? It's easy to explain at
the call site that, when changing to a different server or using
CONNECTION instead, that we don't need the old conninfo at all.I included your test case in my patch, and it passes.
Also, Hayato Kuroda's report was an issue also because the error could
be thrown even if slotname was NULL. Patch attached for that, as well.
Thank you, also!Regards,
Jeff Davis<v4-0001-Avoid-errors-during-ALTER-SUBSCRIPTION.patch><v4-0002-Avoid-errors-during-DROP-SUBSCRIPTION.patch>
Ah, I see. You added a new conninfo_needed parameter to GetSubscription(), which separates the decision of building conninfo from the ACL check. Cool, I believe this is a better approach.
So 0001 looks good to me. nitpick is that conninfo_aclcheck is now only meaningful when conninfo_needed is true. I wonder if we should mention that briefly in the function header comment, or add an assertion such as: Assert(conninfo_needed || !conninfo_aclcheck); to avoid possible misuse of conninfo_aclcheck in the future.
For 0002, I have a doubt. Now conninfo is built only when slotname is not NULL. But after reading through DropSubscription(), I am not sure conninfo is strictly tied to slotname.
For example, this fast path returns only when both slotname is NULL and rstates is NIL:
```
/*
* If there is no slot associated with the subscription, we can finish
* here.
*/
if (!slotname && rstates == NIL)
{
table_close(rel, NoLock);
return;
}
```
That seems to imply that even when slotname is NULL, rstates might still be not NIL.
Later, if conninfo is not NULL, the code connects to the publisher and does some cleanup work for tablesync slots:
```
if (conninfo)
wrconn = walrcv_connect(conninfo, true, true, must_use_password,
subname, &err);
...
/*
* Drop the tablesync slots associated with removed tables.
*
* For SYNCDONE/READY states, the tablesync slot is known to have
* already been dropped by the tablesync worker.
*
* For other states, there is no certainty, maybe the slot does
* not exist yet. Also, if we fail after removing some of the
* slots, next time, it will again try to drop already dropped
* slots and fail. For these reasons, we allow missing_ok = true
* for the drop.
*/
if (rstate->state != SUBREL_STATE_SYNCDONE)
{
char syncslotname[NAMEDATALEN] = {0};
ReplicationSlotNameForTablesync(subid, relid, syncslotname,
sizeof(syncslotname));
ReplicationSlotDropAtPubNode(wrconn, syncslotname, true);
}
```
So with 0002, if slotname is NULL but rstates is not NIL, it looks possible that we no longer build conninfo and therefore skip the cleanup on the publisher side.
Best reagards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/