Backend crash on non-exclusive backup cancel
I found this issue while working on a pg_stop_backup() patch. If a
non-exclusive pg_stop_backup() is cancelled and then attempted again the
backend will crash on assertion:
$ test/pg/bin/psql
psql (10devel)
Type "help" for help.
postgres=# select * from pg_start_backup('label', true, false);
pg_start_backup
-----------------
0/2000028
(1 row)
postgres=# select * from pg_stop_backup(false);
^CCancel request sent
ERROR: canceling statement due to user request
postgres=# select * from pg_stop_backup(false);
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
!> \q
From the server log:
2017-02-28 01:21:34.755 UTC STATEMENT: select * from pg_stop_backup(false);
TRAP: FailedAssertion("!(XLogCtl->Insert.nonExclusiveBackups > 0)",
File: "/postgres/src/backend/access/transam/xlog.c", Line: 10723)
This error was produced in master at 30df93f. Configure settings are
--enable-cassert --enable-tap-tests --with-openssl.
Disabling assertions "works", but there is still a problem. A backend
that keeps cancelling pg_stop_backup() without ever resetting the
exclusive flag in xlogfunc.c can decrement the the shared variable
XLogCtl->Insert.nonExclusiveBackups as many times as it wants. As far
as I can see the worst that will happen is that
XLogCtl->Insert.forcePageWrites won't get set back to false, but that's
still a bug.
This condition should throw "backup is not in progress" just as a
exclusive backup would, whether assertions are enabled or not.
I believe the solution is to move the exclusive flag to xlog.c and only
decrement XLogCtl->Insert.nonExclusiveBackups when exclusive is true,
otherwise return an error. Even then, it wouldn't be clear if the
backup had completed or not. I suppose any cancelled non-exclusive
pg_stop_backup() should be considered aborted whether a stop backup
record was written or not?
If that makes sense I'm happy to work up a patch. This is definitely an
edge case and I seriously doubt it is causing any issues in the field.
--
-David
david@pgmasters.net
--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs
On Tue, Feb 28, 2017 at 10:33 AM, David Steele <david@pgmasters.net> wrote:
This condition should throw "backup is not in progress" just as a
exclusive backup would, whether assertions are enabled or not.I believe the solution is to move the exclusive flag to xlog.c and only
decrement XLogCtl->Insert.nonExclusiveBackups when exclusive is true,
otherwise return an error. Even then, it wouldn't be clear if the
backup had completed or not.
I understand by this sentence that you mean
nonexclusive_backup_running, and I think that I agree with that. We
need a way to reset it properly the session-level switch in case of an
interrupt found, and that needs visibly to happen in
pg_stop_backup_callback and pg_start_backup_callback(). At the same
time I think that exclusive_backup_running had better be moved to
xlog.c as well. If I look at the failure in details when issuing a
cancel, I can see that XLogCtl->Insert.nonExclusiveBackups gets
decremented at the end do_pg_stop_backup, but
nonexclusive_backup_running never gets set back to false because of
the query cancellation.
I suppose any cancelled non-exclusive
pg_stop_backup() should be considered aborted whether a stop backup
record was written or not?
That's not necessarily true, I can see a stop backup able to finish as
well by issuing a cancel request. It seems to me that we just need to
have the shmem information updated at the same time as the
session-level switches for consistency and we're good. The
inconsistency in places when updating the session-level flags and the
shmem-level flags is what is causing harm.
If that makes sense I'm happy to work up a patch. This is definitely an
edge case and I seriously doubt it is causing any issues in the field.
Well, at least it is nothing caused directly by 974ece5, I am able to
crash the server with or without that...
There is actually some code that I know of that can issue a cancel
request on pg_stop_backend(), this does not have assertions of course
but it may become an issue. That's pretty close to the improvements I
did recently for lock handling of exclusive backups, so there's
nothing really new for me :)
David, are you willing to write a patch or should I? Changing
nonexclusive_backup_running/exclusive_backup_running to be an enum
would make the code more readable as well IMO.
--
Michael
--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs
On 2/27/17 10:05 PM, Michael Paquier wrote:
On Tue, Feb 28, 2017 at 10:33 AM, David Steele <david@pgmasters.net> wrote:
I believe the solution is to move the exclusive flag to xlog.c and only
decrement XLogCtl->Insert.nonExclusiveBackups when exclusive is true,
otherwise return an error. Even then, it wouldn't be clear if the
backup had completed or not.I understand by this sentence that you mean
nonexclusive_backup_running, and I think that I agree with that.
Whoops! Yes, I meant the nonexclusive_backup_running flag.
We
need a way to reset it properly the session-level switch in case of an
interrupt found, and that needs visibly to happen in
pg_stop_backup_callback and pg_start_backup_callback(). At the same
time I think that exclusive_backup_running had better be moved to
xlog.c as well. If I look at the failure in details when issuing a
cancel,
Agreed.
I can see that XLogCtl->Insert.nonExclusiveBackups gets
decremented at the end do_pg_stop_backup, but
nonexclusive_backup_running never gets set back to false because of
the query cancellation.
Exactly.
I suppose any cancelled non-exclusive
pg_stop_backup() should be considered aborted whether a stop backup
record was written or not?That's not necessarily true, I can see a stop backup able to finish as
well by issuing a cancel request. It seems to me that we just need to
have the shmem information updated at the same time as the
session-level switches for consistency and we're good. The
inconsistency in places when updating the session-level flags and the
shmem-level flags is what is causing harm.
I'm sure this could be done but it will require quite a bit of
refactoring and I'm not sure that it's worth it. In my mind it would be
enough to document that cancelled backups should be considered invalid
whether the stop backup record was written or not. However, I'm willing
to go with the majority opinion.
If that makes sense I'm happy to work up a patch. This is definitely an
edge case and I seriously doubt it is causing any issues in the field.<...>
David, are you willing to write a patch or should I? Changing
nonexclusive_backup_running/exclusive_backup_running to be an enum
would make the code more readable as well IMO.
I would like to see if anyone else weighs in on this first, but yes I am
planning to write a patch. Agreed on the enum.
--
-David
david@pgmasters.net
--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs
On Tue, Feb 28, 2017 at 10:21 PM, David Steele <david@pgmasters.net> wrote:
On 2/27/17 10:05 PM, Michael Paquier wrote:
That's not necessarily true, I can see a stop backup able to finish as
well by issuing a cancel request. It seems to me that we just need to
have the shmem information updated at the same time as the
session-level switches for consistency and we're good. The
inconsistency in places when updating the session-level flags and the
shmem-level flags is what is causing harm.I'm sure this could be done but it will require quite a bit of
refactoring and I'm not sure that it's worth it. In my mind it would be
enough to document that cancelled backups should be considered invalid
whether the stop backup record was written or not. However, I'm willing
to go with the majority opinion.
I would think that addressing the problem is the way to go, hitting an
assertion in this code path means that we are doing something wrong so
simply removing it does not sound like a correct answer to me.
If that makes sense I'm happy to work up a patch. This is definitely an
edge case and I seriously doubt it is causing any issues in the field.<...>
David, are you willing to write a patch or should I? Changing
nonexclusive_backup_running/exclusive_backup_running to be an enum
would make the code more readable as well IMO.I would like to see if anyone else weighs in on this first, but yes I am
planning to write a patch. Agreed on the enum.
This was itching me yesterday so I wrote a patch able to address this
problem... At least it can be registered to the CF on time and give it
more visibility. By the way, I have noticed a second bug in the
current logic while looking at the problem you have reported:
1) Start backup in session 1:
=# select pg_start_backup('popo');
pg_start_backup
-----------------
0/2000028
(1 row)
2) pg_stop_backup() in session 2
3) And now when trying to work on backups with session 1:
=# select pg_start_backup('popo');
ERROR: 55000: a backup is already in progress in this session
LOCATION: pg_start_backup, xlogfuncs.c:87
=# select pg_stop_backup();
ERROR: 55000: exclusive backup not in progress
LOCATION: do_pg_stop_backup, xlog.c:10642
So the handling around exclusive_backup_running is broken now as it is
possible to lock a session easily when handling backups. And the root
of the problem is that checks on exclusive_backup_running are not
necessary. I have fixed as well this problem as per the attached. Note
however that SESSION_BACKUP_EXCLUSIVE still makes sense in the patch
attached when doing checks with pg_stop_backup_v2() for a
non-exclusive backup run. It is also useful for debugging.
--
Michael
Attachments:
backup-session-locks-fixes.patchapplication/octet-stream; name=backup-session-locks-fixes.patchDownload
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 897358342d..16aa7f3be2 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -503,6 +503,12 @@ typedef enum ExclusiveBackupState
} ExclusiveBackupState;
/*
+ * Session status of running backup, used for sanity checks in SQL-callable
+ * functions to start and stop backups.
+ */
+static SessionBackupState sessionBackupState = SESSION_BACKUP_NONE;
+
+/*
* Shared state data for WAL insertion.
*/
typedef struct XLogCtlInsert
@@ -10503,13 +10509,17 @@ do_pg_start_backup(const char *backupidstr, bool fast, TimeLineID *starttli_p,
/*
* Mark that start phase has correctly finished for an exclusive backup.
+ * Session-level locks are updated as well to reflect that state.
*/
if (exclusive)
{
WALInsertLockAcquireExclusive();
XLogCtl->Insert.exclusiveBackupState = EXCLUSIVE_BACKUP_IN_PROGRESS;
WALInsertLockRelease();
+ sessionBackupState = SESSION_BACKUP_EXCLUSIVE;
}
+ else
+ sessionBackupState = SESSION_BACKUP_NON_EXCLUSIVE;
/*
* We're done. As a convenience, return the starting WAL location.
@@ -10565,6 +10575,15 @@ pg_stop_backup_callback(int code, Datum arg)
}
/*
+ * Utility routine to fetch the session-level status of a backup running.
+ */
+SessionBackupState
+get_backup_status(void)
+{
+ return sessionBackupState;
+}
+
+/*
* do_pg_stop_backup is the workhorse of the user-visible pg_stop_backup()
* function.
*
@@ -10731,6 +10750,9 @@ do_pg_stop_backup(char *labelfile, bool waitforarchive, TimeLineID *stoptli_p)
}
WALInsertLockRelease();
+ /* Clean up session-level lock */
+ sessionBackupState = SESSION_BACKUP_NONE;
+
/*
* Read and parse the START WAL LOCATION line (this code is pretty crude,
* but we are not expecting any variability in the file format).
diff --git a/src/backend/access/transam/xlogfuncs.c b/src/backend/access/transam/xlogfuncs.c
index 27c0c561a8..68ebdd831e 100644
--- a/src/backend/access/transam/xlogfuncs.c
+++ b/src/backend/access/transam/xlogfuncs.c
@@ -42,8 +42,6 @@
*/
static StringInfo label_file;
static StringInfo tblspc_map_file;
-static bool exclusive_backup_running = false;
-static bool nonexclusive_backup_running = false;
/*
* Called when the backend exits with a running non-exclusive base backup,
@@ -78,10 +76,11 @@ pg_start_backup(PG_FUNCTION_ARGS)
char *backupidstr;
XLogRecPtr startpoint;
DIR *dir;
+ SessionBackupState status = get_backup_status();
backupidstr = text_to_cstring(backupid);
- if (exclusive_backup_running || nonexclusive_backup_running)
+ if (status == SESSION_BACKUP_NON_EXCLUSIVE)
ereport(ERROR,
(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
errmsg("a backup is already in progress in this session")));
@@ -96,7 +95,6 @@ pg_start_backup(PG_FUNCTION_ARGS)
{
startpoint = do_pg_start_backup(backupidstr, fast, NULL, NULL,
dir, NULL, NULL, false, true);
- exclusive_backup_running = true;
}
else
{
@@ -113,7 +111,6 @@ pg_start_backup(PG_FUNCTION_ARGS)
startpoint = do_pg_start_backup(backupidstr, fast, NULL, label_file,
dir, NULL, tblspc_map_file, false, true);
- nonexclusive_backup_running = true;
before_shmem_exit(nonexclusive_base_backup_cleanup, (Datum) 0);
}
@@ -147,8 +144,9 @@ Datum
pg_stop_backup(PG_FUNCTION_ARGS)
{
XLogRecPtr stoppoint;
+ SessionBackupState status = get_backup_status();
- if (nonexclusive_backup_running)
+ if (status == SESSION_BACKUP_NON_EXCLUSIVE)
ereport(ERROR,
(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
errmsg("non-exclusive backup in progress"),
@@ -156,14 +154,12 @@ pg_stop_backup(PG_FUNCTION_ARGS)
/*
* Exclusive backups were typically started in a different connection, so
- * don't try to verify that exclusive_backup_running is set in this one.
- * Actual verification that an exclusive backup is in fact running is
- * handled inside do_pg_stop_backup.
+ * don't try to verify that status of backup is set to
+ * SESSION_BACKUP_EXCLUSIVE in this one. Actual verification that an
+ * exclusive backup is in fact running is handled inside do_pg_stop_backup.
*/
stoppoint = do_pg_stop_backup(NULL, true, NULL);
- exclusive_backup_running = false;
-
PG_RETURN_LSN(stoppoint);
}
@@ -191,6 +187,7 @@ pg_stop_backup_v2(PG_FUNCTION_ARGS)
bool exclusive = PG_GETARG_BOOL(0);
XLogRecPtr stoppoint;
+ SessionBackupState status = get_backup_status();
/* check to see if caller supports us returning a tuplestore */
if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
@@ -222,7 +219,7 @@ pg_stop_backup_v2(PG_FUNCTION_ARGS)
if (exclusive)
{
- if (nonexclusive_backup_running)
+ if (status == SESSION_BACKUP_NON_EXCLUSIVE)
ereport(ERROR,
(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
errmsg("non-exclusive backup in progress"),
@@ -233,14 +230,13 @@ pg_stop_backup_v2(PG_FUNCTION_ARGS)
* return NULL for both backup_label and tablespace_map.
*/
stoppoint = do_pg_stop_backup(NULL, true, NULL);
- exclusive_backup_running = false;
nulls[1] = true;
nulls[2] = true;
}
else
{
- if (!nonexclusive_backup_running)
+ if (status != SESSION_BACKUP_NON_EXCLUSIVE)
ereport(ERROR,
(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
errmsg("non-exclusive backup is not in progress"),
@@ -251,7 +247,6 @@ pg_stop_backup_v2(PG_FUNCTION_ARGS)
* and tablespace map so they can be written to disk by the caller.
*/
stoppoint = do_pg_stop_backup(label_file->data, true, NULL);
- nonexclusive_backup_running = false;
cancel_before_shmem_exit(nonexclusive_base_backup_cleanup, (Datum) 0);
values[1] = CStringGetTextDatum(label_file->data);
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 9f036c72d8..97a4782c0c 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -287,8 +287,26 @@ extern void assign_max_wal_size(int newval, void *extra);
extern void assign_checkpoint_completion_target(double newval, void *extra);
/*
- * Starting/stopping a base backup
+ * Routines to starting stop, and get status of a base backup
*/
+
+/*
+ * Session-level status of base backups
+ *
+ * This is used in parallel of the shared memory status to control parallel
+ * execution of base backup functions for a given session, be it a backend
+ * dedicated to replication or a normal backend connected to a database. The
+ * update of the session-level status happens at the same time as the shared
+ * memory counters to keep a consistent global and local state of the backups
+ * running.
+ */
+typedef enum SessionBackupState
+{
+ SESSION_BACKUP_NONE,
+ SESSION_BACKUP_EXCLUSIVE,
+ SESSION_BACKUP_NON_EXCLUSIVE
+} SessionBackupState;
+
extern XLogRecPtr do_pg_start_backup(const char *backupidstr, bool fast,
TimeLineID *starttli_p, StringInfo labelfile, DIR *tblspcdir,
List **tablespaces, StringInfo tblspcmapfile, bool infotbssize,
@@ -296,6 +314,7 @@ extern XLogRecPtr do_pg_start_backup(const char *backupidstr, bool fast,
extern XLogRecPtr do_pg_stop_backup(char *labelfile, bool waitforarchive,
TimeLineID *stoptli_p);
extern void do_pg_abort_backup(void);
+extern SessionBackupState get_backup_status(void);
/* File path names (all relative to $PGDATA) */
#define BACKUP_LABEL_FILE "backup_label"
On 2/28/17 9:08 PM, Michael Paquier wrote:
On Tue, Feb 28, 2017 at 10:21 PM, David Steele <david@pgmasters.net> wrote:
On 2/27/17 10:05 PM, Michael Paquier wrote:
That's not necessarily true, I can see a stop backup able to finish as
well by issuing a cancel request. It seems to me that we just need to
have the shmem information updated at the same time as the
session-level switches for consistency and we're good. The
inconsistency in places when updating the session-level flags and the
shmem-level flags is what is causing harm.I'm sure this could be done but it will require quite a bit of
refactoring and I'm not sure that it's worth it. In my mind it would be
enough to document that cancelled backups should be considered invalid
whether the stop backup record was written or not. However, I'm willing
to go with the majority opinion.I would think that addressing the problem is the way to go, hitting an
assertion in this code path means that we are doing something wrong so
simply removing it does not sound like a correct answer to me.If that makes sense I'm happy to work up a patch. This is definitely an
edge case and I seriously doubt it is causing any issues in the field.<...>
David, are you willing to write a patch or should I? Changing
nonexclusive_backup_running/exclusive_backup_running to be an enum
would make the code more readable as well IMO.I would like to see if anyone else weighs in on this first, but yes I am
planning to write a patch. Agreed on the enum.This was itching me yesterday so I wrote a patch able to address this
problem...
I know that feeling.
At least it can be registered to the CF on time and give it
more visibility.
Indeed.
By the way, I have noticed a second bug in the
current logic while looking at the problem you have reported:
1) Start backup in session 1:
=# select pg_start_backup('popo');
pg_start_backup
-----------------
0/2000028
(1 row)
2) pg_stop_backup() in session 2
3) And now when trying to work on backups with session 1:
=# select pg_start_backup('popo');
ERROR: 55000: a backup is already in progress in this session
LOCATION: pg_start_backup, xlogfuncs.c:87
=# select pg_stop_backup();
ERROR: 55000: exclusive backup not in progress
LOCATION: do_pg_stop_backup, xlog.c:10642
Yeah, that doesn't look good.
So the handling around exclusive_backup_running is broken now as it is
possible to lock a session easily when handling backups. And the root
of the problem is that checks on exclusive_backup_running are not
necessary. I have fixed as well this problem as per the attached. Note
however that SESSION_BACKUP_EXCLUSIVE still makes sense in the patch
attached when doing checks with pg_stop_backup_v2() for a
non-exclusive backup run. It is also useful for debugging.
Excellent! I'll have a look, and thanks for working up a patch.
--
-David
david@pgmasters.net
--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs
As far as I understand, in this thread were discussed two bugs of pg_stop_backup().
Thanks to the clear descriptions above, I easily reproduced both of them.
BUG#1:
Server crashes on assertion on call of pg_stop_backup(false) after interrupted call of pg_stop_backup(false).
TRAP: FailedAssertion("!(XLogCtl->Insert.nonExclusiveBackups > 0)", File: "xlog.c", Line: 10747)
BUG#2:
Failure to start an exclusive backup with the same name, if previous exclusive backup was stopped in another session.
Both problems seem to be fixed with patch "backup-session-locks-fixes.patch".
Speaking of the patch itself, I have a question: shouldn't we also update sessionBackupState
in pg_stop_backup_callback() along with XLogCtl->Insert.exclusiveBackupState?
And couple of minor notes:
1) + * Routines to starting stop, and get status of a base backup
Probably should be: + * Routines to start, stop and get status of a base backup
And also this comment should be moved below the enum.
2) This is used in parallel of the shared memory status
s/ in parallel of/ in parallel with
The new status of this patch is: Waiting on Author
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Mar 15, 2017 at 12:27 AM, Anastasia Lubennikova
<lubennikovaav@gmail.com> wrote:
As far as I understand, in this thread were discussed two bugs of pg_stop_backup().
Thanks to the clear descriptions above, I easily reproduced both of them.BUG#1:
Server crashes on assertion on call of pg_stop_backup(false) after interrupted call of pg_stop_backup(false).
TRAP: FailedAssertion("!(XLogCtl->Insert.nonExclusiveBackups > 0)", File: "xlog.c", Line: 10747)BUG#2:
Failure to start an exclusive backup with the same name, if previous exclusive backup was stopped in another session.Both problems seem to be fixed with patch "backup-session-locks-fixes.patch".
Speaking of the patch itself, I have a question: shouldn't we also update sessionBackupState
in pg_stop_backup_callback() along with XLogCtl->Insert.exclusiveBackupState?
No, that's not necessary. sessionBackupState is not changed until the
code path where pg_stop_backup_callback() is active, and remains
unchanged until it gets deactivated.
And couple of minor notes:
1) + * Routines to starting stop, and get status of a base backup
Probably should be: + * Routines to start, stop and get status of a base backup
And also this comment should be moved below the enum.2) This is used in parallel of the shared memory status
s/ in parallel of/ in parallel with
Agreed on both points.
--
Michael
Attachments:
backup-session-locks-fixes-v2.patchapplication/octet-stream; name=backup-session-locks-fixes-v2.patchDownload
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 64335f909e..eaf8e32fe1 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -504,6 +504,12 @@ typedef enum ExclusiveBackupState
} ExclusiveBackupState;
/*
+ * Session status of running backup, used for sanity checks in SQL-callable
+ * functions to start and stop backups.
+ */
+static SessionBackupState sessionBackupState = SESSION_BACKUP_NONE;
+
+/*
* Shared state data for WAL insertion.
*/
typedef struct XLogCtlInsert
@@ -10527,13 +10533,17 @@ do_pg_start_backup(const char *backupidstr, bool fast, TimeLineID *starttli_p,
/*
* Mark that start phase has correctly finished for an exclusive backup.
+ * Session-level locks are updated as well to reflect that state.
*/
if (exclusive)
{
WALInsertLockAcquireExclusive();
XLogCtl->Insert.exclusiveBackupState = EXCLUSIVE_BACKUP_IN_PROGRESS;
WALInsertLockRelease();
+ sessionBackupState = SESSION_BACKUP_EXCLUSIVE;
}
+ else
+ sessionBackupState = SESSION_BACKUP_NON_EXCLUSIVE;
/*
* We're done. As a convenience, return the starting WAL location.
@@ -10589,6 +10599,15 @@ pg_stop_backup_callback(int code, Datum arg)
}
/*
+ * Utility routine to fetch the session-level status of a backup running.
+ */
+SessionBackupState
+get_backup_status(void)
+{
+ return sessionBackupState;
+}
+
+/*
* do_pg_stop_backup is the workhorse of the user-visible pg_stop_backup()
* function.
*
@@ -10755,6 +10774,9 @@ do_pg_stop_backup(char *labelfile, bool waitforarchive, TimeLineID *stoptli_p)
}
WALInsertLockRelease();
+ /* Clean up session-level lock */
+ sessionBackupState = SESSION_BACKUP_NONE;
+
/*
* Read and parse the START WAL LOCATION line (this code is pretty crude,
* but we are not expecting any variability in the file format).
diff --git a/src/backend/access/transam/xlogfuncs.c b/src/backend/access/transam/xlogfuncs.c
index 96aa15e9cc..74927f27d3 100644
--- a/src/backend/access/transam/xlogfuncs.c
+++ b/src/backend/access/transam/xlogfuncs.c
@@ -42,8 +42,6 @@
*/
static StringInfo label_file;
static StringInfo tblspc_map_file;
-static bool exclusive_backup_running = false;
-static bool nonexclusive_backup_running = false;
/*
* Called when the backend exits with a running non-exclusive base backup,
@@ -78,10 +76,11 @@ pg_start_backup(PG_FUNCTION_ARGS)
char *backupidstr;
XLogRecPtr startpoint;
DIR *dir;
+ SessionBackupState status = get_backup_status();
backupidstr = text_to_cstring(backupid);
- if (exclusive_backup_running || nonexclusive_backup_running)
+ if (status == SESSION_BACKUP_NON_EXCLUSIVE)
ereport(ERROR,
(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
errmsg("a backup is already in progress in this session")));
@@ -96,7 +95,6 @@ pg_start_backup(PG_FUNCTION_ARGS)
{
startpoint = do_pg_start_backup(backupidstr, fast, NULL, NULL,
dir, NULL, NULL, false, true);
- exclusive_backup_running = true;
}
else
{
@@ -113,7 +111,6 @@ pg_start_backup(PG_FUNCTION_ARGS)
startpoint = do_pg_start_backup(backupidstr, fast, NULL, label_file,
dir, NULL, tblspc_map_file, false, true);
- nonexclusive_backup_running = true;
before_shmem_exit(nonexclusive_base_backup_cleanup, (Datum) 0);
}
@@ -147,8 +144,9 @@ Datum
pg_stop_backup(PG_FUNCTION_ARGS)
{
XLogRecPtr stoppoint;
+ SessionBackupState status = get_backup_status();
- if (nonexclusive_backup_running)
+ if (status == SESSION_BACKUP_NON_EXCLUSIVE)
ereport(ERROR,
(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
errmsg("non-exclusive backup in progress"),
@@ -156,14 +154,12 @@ pg_stop_backup(PG_FUNCTION_ARGS)
/*
* Exclusive backups were typically started in a different connection, so
- * don't try to verify that exclusive_backup_running is set in this one.
- * Actual verification that an exclusive backup is in fact running is
- * handled inside do_pg_stop_backup.
+ * don't try to verify that status of backup is set to
+ * SESSION_BACKUP_EXCLUSIVE in this one. Actual verification that an
+ * exclusive backup is in fact running is handled inside do_pg_stop_backup.
*/
stoppoint = do_pg_stop_backup(NULL, true, NULL);
- exclusive_backup_running = false;
-
PG_RETURN_LSN(stoppoint);
}
@@ -191,6 +187,7 @@ pg_stop_backup_v2(PG_FUNCTION_ARGS)
bool exclusive = PG_GETARG_BOOL(0);
XLogRecPtr stoppoint;
+ SessionBackupState status = get_backup_status();
/* check to see if caller supports us returning a tuplestore */
if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
@@ -222,7 +219,7 @@ pg_stop_backup_v2(PG_FUNCTION_ARGS)
if (exclusive)
{
- if (nonexclusive_backup_running)
+ if (status == SESSION_BACKUP_NON_EXCLUSIVE)
ereport(ERROR,
(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
errmsg("non-exclusive backup in progress"),
@@ -233,14 +230,13 @@ pg_stop_backup_v2(PG_FUNCTION_ARGS)
* return NULL for both backup_label and tablespace_map.
*/
stoppoint = do_pg_stop_backup(NULL, true, NULL);
- exclusive_backup_running = false;
nulls[1] = true;
nulls[2] = true;
}
else
{
- if (!nonexclusive_backup_running)
+ if (status != SESSION_BACKUP_NON_EXCLUSIVE)
ereport(ERROR,
(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
errmsg("non-exclusive backup is not in progress"),
@@ -251,7 +247,6 @@ pg_stop_backup_v2(PG_FUNCTION_ARGS)
* and tablespace map so they can be written to disk by the caller.
*/
stoppoint = do_pg_stop_backup(label_file->data, true, NULL);
- nonexclusive_backup_running = false;
cancel_before_shmem_exit(nonexclusive_base_backup_cleanup, (Datum) 0);
values[1] = CStringGetTextDatum(label_file->data);
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 104ee7dd5e..d4abf94862 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -288,8 +288,26 @@ extern void assign_max_wal_size(int newval, void *extra);
extern void assign_checkpoint_completion_target(double newval, void *extra);
/*
- * Starting/stopping a base backup
+ * Routines to start, stop, and get status of a base backup.
*/
+
+/*
+ * Session-level status of base backups
+ *
+ * This is used in parallel with the shared memory status to control parallel
+ * execution of base backup functions for a given session, be it a backend
+ * dedicated to replication or a normal backend connected to a database. The
+ * update of the session-level status happens at the same time as the shared
+ * memory counters to keep a consistent global and local state of the backups
+ * running.
+ */
+typedef enum SessionBackupState
+{
+ SESSION_BACKUP_NONE,
+ SESSION_BACKUP_EXCLUSIVE,
+ SESSION_BACKUP_NON_EXCLUSIVE
+} SessionBackupState;
+
extern XLogRecPtr do_pg_start_backup(const char *backupidstr, bool fast,
TimeLineID *starttli_p, StringInfo labelfile, DIR *tblspcdir,
List **tablespaces, StringInfo tblspcmapfile, bool infotbssize,
@@ -297,6 +315,7 @@ extern XLogRecPtr do_pg_start_backup(const char *backupidstr, bool fast,
extern XLogRecPtr do_pg_stop_backup(char *labelfile, bool waitforarchive,
TimeLineID *stoptli_p);
extern void do_pg_abort_backup(void);
+extern SessionBackupState get_backup_status(void);
/* File path names (all relative to $PGDATA) */
#define BACKUP_LABEL_FILE "backup_label"
The following review has been posted through the commitfest application:
make installcheck-world: tested, passed
Implements feature: tested, passed
Spec compliant: tested, passed
Documentation: tested, passed
As I see, this bugfix works as expected and the patch is small and clear,
so I marked it "Ready for committer".
Anyway, it would be great if David could also have a look at the patch.
And again, thank you for fixing this issue!
The new status of this patch is: Ready for Committer
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 3/15/17 2:28 AM, Michael Paquier wrote:
On Wed, Mar 15, 2017 at 12:27 AM, Anastasia Lubennikova
<lubennikovaav@gmail.com> wrote:As far as I understand, in this thread were discussed two bugs of pg_stop_backup().
Thanks to the clear descriptions above, I easily reproduced both of them.BUG#1:
Server crashes on assertion on call of pg_stop_backup(false) after interrupted call of pg_stop_backup(false).
TRAP: FailedAssertion("!(XLogCtl->Insert.nonExclusiveBackups > 0)", File: "xlog.c", Line: 10747)BUG#2:
Failure to start an exclusive backup with the same name, if previous exclusive backup was stopped in another session.Both problems seem to be fixed with patch "backup-session-locks-fixes.patch".
Speaking of the patch itself, I have a question: shouldn't we also update sessionBackupState
in pg_stop_backup_callback() along with XLogCtl->Insert.exclusiveBackupState?No, that's not necessary. sessionBackupState is not changed until the
code path where pg_stop_backup_callback() is active, and remains
unchanged until it gets deactivated.And couple of minor notes:
1) + * Routines to starting stop, and get status of a base backup
Probably should be: + * Routines to start, stop and get status of a base backup
And also this comment should be moved below the enum.2) This is used in parallel of the shared memory status
s/ in parallel of/ in parallel withAgreed on both points.
I have tested this patch and it behaves as expected and fixes the
original issue I reported. One nit, I think:
+ * SESSION_BACKUP_EXCLUSIVE in this one. Actual verification that an
Would be better phrased as:
+ * SESSION_BACKUP_EXCLUSIVE in this function. Actual verification that an
Thanks,
--
-David
david@pgmasters.net
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hi,
On 2017-03-15 15:28:03 +0900, Michael Paquier wrote:
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 64335f909e..eaf8e32fe1 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c --- a/src/backend/access/transam/xlogfuncs.c +++ b/src/backend/access/transam/xlogfuncs.c diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h index 104ee7dd5e..d4abf94862 100644 --- a/src/include/access/xlog.h +++ b/src/include/access/xlog.h @@ -288,8 +288,26 @@ extern void assign_max_wal_size(int newval, void *extra); extern void assign_checkpoint_completion_target(double newval, void *extra);
This seems like something easy enough to exercise in a tap test, could
we get one?
Greetings,
Andres Freund
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Mar 16, 2017 at 12:46 AM, Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2017-03-15 15:28:03 +0900, Michael Paquier wrote:
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 64335f909e..eaf8e32fe1 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c --- a/src/backend/access/transam/xlogfuncs.c +++ b/src/backend/access/transam/xlogfuncs.c diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h index 104ee7dd5e..d4abf94862 100644 --- a/src/include/access/xlog.h +++ b/src/include/access/xlog.h @@ -288,8 +288,26 @@ extern void assign_max_wal_size(int newval, void *extra); extern void assign_checkpoint_completion_target(double newval, void *extra);This seems like something easy enough to exercise in a tap test, could
we get one?
The first problem needs to have a cancel request sent when
pg_stop_backup() is running, the second needs to have a session held
to see an the inconsistent status of the session lock, which are two
concepts foreign to the TAP tests without being able to run the
queries asynchronously and keep the sessions alive.
--
Michael
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hi!
I believe patch looks good and it's ready to commit. As I understand, it fixes
bug introduced by
commit 7117685461af50f50c03f43e6a622284c8d54694
Date: Tue Apr 5 20:03:49 2016 +0200
Implement backup API functions for non-exclusive backups
And, suppose, it should be backpatched to 9.6?
--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Mar 24, 2017 at 1:45 AM, Teodor Sigaev <teodor@sigaev.ru> wrote:
I believe patch looks good and it's ready to commit.
Thanks for the review!
As I understand, it fixes bug introduced by
commit 7117685461af50f50c03f43e6a622284c8d54694
Date: Tue Apr 5 20:03:49 2016 +0200Implement backup API functions for non-exclusive backups
Indeed.
And, suppose, it should be backpatched to 9.6?
Yes, that's where non-exclusive backups have been introduced.
--
Michael
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Thank you all, pushed
Michael Paquier wrote:
On Fri, Mar 24, 2017 at 1:45 AM, Teodor Sigaev <teodor@sigaev.ru> wrote:
I believe patch looks good and it's ready to commit.
Thanks for the review!
As I understand, it fixes bug introduced by
commit 7117685461af50f50c03f43e6a622284c8d54694
Date: Tue Apr 5 20:03:49 2016 +0200Implement backup API functions for non-exclusive backups
Indeed.
And, suppose, it should be backpatched to 9.6?
Yes, that's where non-exclusive backups have been introduced.
--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Mar 24, 2017 at 7:58 PM, Teodor Sigaev <teodor@sigaev.ru> wrote:
Thank you all, pushed.
Thanks.
--
Michael
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers