pg_createsubscriber: Fix incorrect handling of cleanup flags
Hi Hackers,
In pg_createsubscriber, the flags 'made_publication' and
'made_replslot' are used to track whether the tool itself created a
publication or replication slot on the primary (publisher) node. If a
failure happens during the process, these flags help the tool decide
whether it should clean up those objects using
cleanup_objects_atexit().
However, there are cases where these flags are wrongly set to false
due to failures on the subscriber side, which causes the tool to skip
cleanup of some objects on the primary even when it should not.
Example: for made_publication
In drop_publication(), if dropping a publication on the subscriber
fails (either a replicated publication or an existing one being
removed with --remove=publications), the made_publication flag is
wrongly set to false.
The process continues without exiting, but if a later step fails,
cleanup_objects_atexit() will see made_publication = false and skip
dropping the publication on the primary, even though it was created
earlier by the tool. This leads to orphaned publication.
Example: for made_replslot
A similar issue exists for replication slots. In
drop_replication_slot(), if dropping a physical replication slot on
the primary, or a failover-synced slot on the subscriber, fails — the
made_replslot flag is set to false.
Again, if the process fails later, cleanup_objects_atexit() will
incorrectly skip dropping the logical replication slot created earlier
on the primary, leaving it behind.
Solution:
The fix ensures that failures in dropping subscriber-side or
non-internal objects should not reset made_publication or
made_replslot.
These flags should only be reset if dropping the internally created
objects on the primary fails. That way, cleanup_objects_atexit() can
still correctly clean up what the tool created if something else goes
wrong later in the process.
Attached is the patch implementing the above proposed solution.
Reviews and feedback are most welcome.
--
Thanks,
Nisha
Attachments:
v1-0001-Fix-incorrect-cleanup-flag-handling-in-pg_creates.patchapplication/x-patch; name=v1-0001-Fix-incorrect-cleanup-flag-handling-in-pg_creates.patchDownload
From ec934502f2da4e7822b851c040f9b832dd90b0a5 Mon Sep 17 00:00:00 2001
From: Nisha Moond <nisha.moond412@gmail.com>
Date: Mon, 14 Apr 2025 12:10:12 +0530
Subject: [PATCH v1] Fix incorrect cleanup flag handling in pg_createsubscriber
The flags made_publication and made_replslot track whether the tool
created a publication or replication slot on the primary. These are
used during error handling to clean up internal objects on primary.
Previously, these flags were incorrectly reset to false when failures
occurred while dropping objects(publications/replication slots) on the
subscriber. As a result, upon a failure, the cleanup_objects_atexit()
skipped cleanup of these objects on the primary.
This patch fixes the issue by making sure not to set the flags to false
if the failure occurs while dropping objects on the subscriber side.
---
src/bin/pg_basebackup/pg_createsubscriber.c | 34 +++++++++++++--------
1 file changed, 22 insertions(+), 12 deletions(-)
diff --git a/src/bin/pg_basebackup/pg_createsubscriber.c b/src/bin/pg_basebackup/pg_createsubscriber.c
index f65acc7cb11..d20c19c4ec3 100644
--- a/src/bin/pg_basebackup/pg_createsubscriber.c
+++ b/src/bin/pg_basebackup/pg_createsubscriber.c
@@ -105,7 +105,7 @@ static void drop_failover_replication_slots(struct LogicalRepInfo *dbinfo);
static char *create_logical_replication_slot(PGconn *conn,
struct LogicalRepInfo *dbinfo);
static void drop_replication_slot(PGconn *conn, struct LogicalRepInfo *dbinfo,
- const char *slot_name);
+ const char *slot_name, bool in_cleanup);
static void pg_ctl_status(const char *pg_ctl_cmd, int rc);
static void start_standby_server(const struct CreateSubscriberOptions *opt,
bool restricted_access,
@@ -115,7 +115,8 @@ static void wait_for_end_recovery(const char *conninfo,
const struct CreateSubscriberOptions *opt);
static void create_publication(PGconn *conn, struct LogicalRepInfo *dbinfo);
static void drop_publication(PGconn *conn, const char *pubname,
- const char *dbname, bool *made_publication);
+ const char *dbname, bool *made_publication,
+ bool in_cleanup);
static void check_and_drop_publications(PGconn *conn, struct LogicalRepInfo *dbinfo);
static void create_subscription(PGconn *conn, const struct LogicalRepInfo *dbinfo);
static void set_replication_progress(PGconn *conn, const struct LogicalRepInfo *dbinfo,
@@ -204,9 +205,9 @@ cleanup_objects_atexit(void)
{
if (dbinfo->made_publication)
drop_publication(conn, dbinfo->pubname, dbinfo->dbname,
- &dbinfo->made_publication);
+ &dbinfo->made_publication, true);
if (dbinfo->made_replslot)
- drop_replication_slot(conn, dbinfo, dbinfo->replslotname);
+ drop_replication_slot(conn, dbinfo, dbinfo->replslotname, true);
disconnect_database(conn, false);
}
else
@@ -1295,7 +1296,7 @@ drop_primary_replication_slot(struct LogicalRepInfo *dbinfo, const char *slotnam
conn = connect_database(dbinfo[0].pubconninfo, false);
if (conn != NULL)
{
- drop_replication_slot(conn, &dbinfo[0], slotname);
+ drop_replication_slot(conn, &dbinfo[0], slotname, false);
disconnect_database(conn, false);
}
else
@@ -1330,7 +1331,7 @@ drop_failover_replication_slots(struct LogicalRepInfo *dbinfo)
{
/* Remove failover replication slots from subscriber */
for (int i = 0; i < PQntuples(res); i++)
- drop_replication_slot(conn, &dbinfo[0], PQgetvalue(res, i, 0));
+ drop_replication_slot(conn, &dbinfo[0], PQgetvalue(res, i, 0), false);
}
else
{
@@ -1407,7 +1408,7 @@ create_logical_replication_slot(PGconn *conn, struct LogicalRepInfo *dbinfo)
static void
drop_replication_slot(PGconn *conn, struct LogicalRepInfo *dbinfo,
- const char *slot_name)
+ const char *slot_name, bool in_cleanup)
{
PQExpBuffer str = createPQExpBuffer();
char *slot_name_esc;
@@ -1433,7 +1434,9 @@ drop_replication_slot(PGconn *conn, struct LogicalRepInfo *dbinfo,
{
pg_log_error("could not drop replication slot \"%s\" in database \"%s\": %s",
slot_name, dbinfo->dbname, PQresultErrorMessage(res));
- dbinfo->made_replslot = false; /* don't try again. */
+
+ if (in_cleanup)
+ dbinfo->made_replslot = false; /* don't try again. */
}
PQclear(res);
@@ -1675,7 +1678,7 @@ create_publication(PGconn *conn, struct LogicalRepInfo *dbinfo)
*/
static void
drop_publication(PGconn *conn, const char *pubname, const char *dbname,
- bool *made_publication)
+ bool *made_publication, bool in_cleanup)
{
PQExpBuffer str = createPQExpBuffer();
PGresult *res;
@@ -1701,7 +1704,6 @@ drop_publication(PGconn *conn, const char *pubname, const char *dbname,
{
pg_log_error("could not drop publication \"%s\" in database \"%s\": %s",
pubname, dbname, PQresultErrorMessage(res));
- *made_publication = false; /* don't try again. */
/*
* Don't disconnect and exit here. This routine is used by primary
@@ -1709,7 +1711,15 @@ drop_publication(PGconn *conn, const char *pubname, const char *dbname,
* subscriber (remove the replicated publications). In both cases,
* it can continue and provide instructions for the user to remove
* it later if cleanup fails.
+ *
+ * in_cleanup is false when dropping an existing publication on
+ * the subscriber (e.g., when --remove=publications is specified).
+ * In this case, the flag 'made_publication' must remain true
+ * because the publication created on the primary still needs to
+ * be cleaned up if a failure occurs later.
*/
+ if (in_cleanup)
+ *made_publication = false; /* don't try again. */
}
PQclear(res);
}
@@ -1752,7 +1762,7 @@ check_and_drop_publications(PGconn *conn, struct LogicalRepInfo *dbinfo)
/* Drop each publication */
for (int i = 0; i < PQntuples(res); i++)
drop_publication(conn, PQgetvalue(res, i, 0), dbinfo->dbname,
- &dbinfo->made_publication);
+ &dbinfo->made_publication, false);
PQclear(res);
}
@@ -1763,7 +1773,7 @@ check_and_drop_publications(PGconn *conn, struct LogicalRepInfo *dbinfo)
*/
if (!drop_all_pubs || dry_run)
drop_publication(conn, dbinfo->pubname, dbinfo->dbname,
- &dbinfo->made_publication);
+ &dbinfo->made_publication, false);
}
/*
--
2.34.1
On Sun, May 4, 2025 at 8:45 PM Nisha Moond <nisha.moond412@gmail.com> wrote:
Attached is the patch implementing the above proposed solution.
Reviews and feedback are most welcome.
I feel like this is just papering over the issue - which is that these two
drop functions are being used for multiple differently named
publications/slots yet take no care to ensure they only change
made_publication and made_repslot if the name of the object being passed in
matches the name of the object those two booleans are specifically tracking
(the application created objects on the publisher).
Make it so they are only changed to false if the name matches the one the
program created and the connection is the primary connection. That targets
the real issue and avoids using a branching boolean parameter.
It seems really odd to say: if (in_cleanup) "don't try again" - since by
definition this is the last thing we are doing before we exit. So really
what this patch can do more simply is just remove the
dbinfo->made_replslot=false and *made_publication=false lines altogether -
which might be a valid option.
I'm partial to the latter really, I don't think the error message output
for retrying a drop that may have previously failed would be an issue.
David J.
On Mon, May 5, 2025 at 11:39 AM David G. Johnston
<david.g.johnston@gmail.com> wrote:
On Sun, May 4, 2025 at 8:45 PM Nisha Moond <nisha.moond412@gmail.com> wrote:
Attached is the patch implementing the above proposed solution.
Reviews and feedback are most welcome.I feel like this is just papering over the issue - which is that these two drop functions are being used for multiple differently named publications/slots yet take no care to ensure they only change made_publication and made_repslot if the name of the object being passed in matches the name of the object those two booleans are specifically tracking (the application created objects on the publisher).
Make it so they are only changed to false if the name matches the one the program created and the connection is the primary connection. That targets the real issue and avoids using a branching boolean parameter.
It seems really odd to say: if (in_cleanup) "don't try again" - since by definition this is the last thing we are doing before we exit. So really what this patch can do more simply is just remove the dbinfo->made_replslot=false and *made_publication=false lines altogether - which might be a valid option.
+1 to removing the dbinfo->made_replslot=false and
*made_publication=false lines. In my tests, I attempted to force
multiple failures, but couldn’t find any case where
cleanup_objects_atexit() would recurse or cause repeated cleanup if
these flags remain set to true.
I'm partial to the latter really, I don't think the error message output for retrying a drop that may have previously failed would be an issue.
As of now, we don’t attempt to drop the same object more than once, so
the latter approach does seem reasonable to me. That said, I’m unsure
why the flags were being reset here in the first place.
Please find the updated patch which removes the false setting of these
flags during drop. If there’s a case I’ve overlooked where this might
be problematic, we can certainly go for your first suggestion to match
the names.
--
Thanks,
Nisha
Attachments:
v2-0001-Fix-incorrect-cleanup-flag-handling-in-pg_creates.patchapplication/x-patch; name=v2-0001-Fix-incorrect-cleanup-flag-handling-in-pg_creates.patchDownload
From 6d50f40d8bd6adc693c607b46600a46fe8337a75 Mon Sep 17 00:00:00 2001
From: Nisha Moond <nisha.moond412@gmail.com>
Date: Tue, 6 May 2025 09:34:14 +0530
Subject: [PATCH v2] Fix incorrect cleanup flag handling in pg_createsubscriber
The flags made_publication and made_replslot track whether the tool
created a publication or replication slot on the primary. These are
used during error handling to clean up internal objects on primary.
Previously, these flags were incorrectly reset to false when failures
occurred while dropping objects(publications/replication slots) on the
subscriber. As a result, upon a failure, the cleanup_objects_atexit()
skipped cleanup of these objects on the primary.
This patch removes the resetting of these cleanup flags during
drop_replication_slot and drop_publication, ensuring proper cleanup of
primary objects before exit on failure.
---
src/bin/pg_basebackup/pg_createsubscriber.c | 2 --
1 file changed, 2 deletions(-)
diff --git a/src/bin/pg_basebackup/pg_createsubscriber.c b/src/bin/pg_basebackup/pg_createsubscriber.c
index f65acc7cb11..20a8ad8ed4d 100644
--- a/src/bin/pg_basebackup/pg_createsubscriber.c
+++ b/src/bin/pg_basebackup/pg_createsubscriber.c
@@ -1433,7 +1433,6 @@ drop_replication_slot(PGconn *conn, struct LogicalRepInfo *dbinfo,
{
pg_log_error("could not drop replication slot \"%s\" in database \"%s\": %s",
slot_name, dbinfo->dbname, PQresultErrorMessage(res));
- dbinfo->made_replslot = false; /* don't try again. */
}
PQclear(res);
@@ -1701,7 +1700,6 @@ drop_publication(PGconn *conn, const char *pubname, const char *dbname,
{
pg_log_error("could not drop publication \"%s\" in database \"%s\": %s",
pubname, dbname, PQresultErrorMessage(res));
- *made_publication = false; /* don't try again. */
/*
* Don't disconnect and exit here. This routine is used by primary
--
2.34.1