[sqlsmith] FailedAssertion("!(XLogCtl->Insert.exclusiveBackup)", File: "xlog.c", Line: 10200)

Started by Andreas Seltenreichover 9 years ago27 messageshackers
Jump to latest
#1Andreas Seltenreich
seltenreich@gmx.de

Hi,

testing with sqlsmith shows that the following assertion doesn't hold:

FailedAssertion("!(XLogCtl->Insert.exclusiveBackup)", File: "xlog.c", Line: 10200)

The triggering statements always contain a call to pg_start_backup with
the third argument 'true', i.e. it's trying to start an exlusive backup.

I didn't manage to put together a stand-alone testcase yet.

regards,
Andreas

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#2Michael Paquier
michael@paquier.xyz
In reply to: Andreas Seltenreich (#1)
Re: [sqlsmith] FailedAssertion("!(XLogCtl->Insert.exclusiveBackup)", File: "xlog.c", Line: 10200)

On Thu, Aug 4, 2016 at 2:19 AM, Andreas Seltenreich <seltenreich@gmx.de> wrote:

testing with sqlsmith shows that the following assertion doesn't hold:

FailedAssertion("!(XLogCtl->Insert.exclusiveBackup)", File: "xlog.c", Line: 10200)

The triggering statements always contain a call to pg_start_backup with
the third argument 'true', i.e. it's trying to start an exlusive backup.

I didn't manage to put together a stand-alone testcase yet.

While I have not been able to trigger this assertion directly, I have
bumped into the fact that pg_stop_backup can reset unconditionally
XLogCtl->Insert.exclusiveBackup *before* pg_start_backup finishes or
even creates the backup_label file if it is set. So the in-memory
state of the backup is like there is no backups running at all
(including exclusive and non-exclusive), but there could be a
backup_label file present. In this state, it is not possible to
trigger pg_start_backup or pg_stop_backup again except if the
backup_label file is manually removed.

In do_pg_stop_backup, both steps would be better reversed, like in the
patch attached. So what we should actually do in pg_stop_backup is
first look at if the backup_label file exists, and then we reset the
in-memory flag as the last thing that do_pg_start_backup does is
writing the backup_label file. This does not close completely the
window though. After the backup_label file is created, it could still
be possible to trigger the assertion if there is an error on the
tablespace map file.

This window exists as well on back-branches btw, this is not new to
9.6. Magnus, what do you think?
--
Michael

Attachments:

base-backup-crash.patchtext/x-diff; charset=US-ASCII; name=base-backup-crash.patchDownload+34-34
#3Michael Paquier
michael@paquier.xyz
In reply to: Michael Paquier (#2)
Re: [sqlsmith] FailedAssertion("!(XLogCtl->Insert.exclusiveBackup)", File: "xlog.c", Line: 10200)

On Thu, Aug 4, 2016 at 4:03 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

This window exists as well on back-branches btw, this is not new to
9.6. Magnus, what do you think?

Actually, a completely water-proof solution is to use an in-memory
flag proper to exclusive backups during the time pg_start_backup() is
called, at the cost of taking WALInsertLockAcquireExclusive once again
at the end of do_pg_start_backup(), but it seems to me that it is an
acceptable cost because pg_start_backup is not a hot code path. Using
a separate flag has the advantage to provide to the user a better
error message when pg_stop_backup is used while pg_start_backup is
running as well.

Andreas, with the patch attached is the assertion still triggered?

Thoughts from others are welcome as well.
--
Michael

Attachments:

base-backup-crash-v2.patchtext/x-diff; charset=US-ASCII; name=base-backup-crash-v2.patchDownload+26-2
#4Michael Paquier
michael@paquier.xyz
In reply to: Michael Paquier (#3)
Re: [sqlsmith] FailedAssertion("!(XLogCtl->Insert.exclusiveBackup)", File: "xlog.c", Line: 10200)

On Thu, Aug 4, 2016 at 4:38 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

Andreas, with the patch attached is the assertion still triggered?

Thoughts from others are welcome as well.

Self review:
      * of streaming base backups currently in progress. forcePageWrites is set
      * to true when either of these is non-zero. lastBackupStart is the latest
      * checkpoint redo location used as a starting point for an online backup.
+     * exclusiveBackupStarted is true while pg_start_backup() is being called
+     * during an exclusive backup.
      */
     bool        exclusiveBackup;
     int            nonExclusiveBackups;
     XLogRecPtr    lastBackupStart;
+    bool        exclusiveBackupStarted;
It would be better to just use one variable that uses an enum to track
the following states of an exclusive backup: none, starting and
in-progress.
-- 
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Michael Paquier (#2)
Re: [sqlsmith] FailedAssertion("!(XLogCtl->Insert.exclusiveBackup)", File: "xlog.c", Line: 10200)

Hello,

While the exact cause of the situation is not known yet but we
have apparently forgot that pg_stop_backup can be called
simultaneously with pg_start_backup. It seems anyway to me that
pg_stop_backup should be called before the end of pg_start_backup
from their definition and what they do. And it should fix the
assertion failure.

On solution is exclusiveBackupStarted (I'd like to rename the
variable) on shared memory as the patch does.

Another place where we can have something with the same effect is
file system. We can create 'backup_label.tmp" at very early in
pg_start_backup and rename it to backup_label at the end of the
function. Anyway exclusive pg_stop_backup needs that. A defect of
that would be the remaining backup_label.tmp file after crash
during pg_start_backup. Renaming tmp to (none) is atomic enough
for this usage but I'm not sure if it's in a network file system.
exclusiveBackup is also used this kind of exclusion, this also is
replasable with the .tmp file.

But after some thoughts, it found to need to add a bunch of error
handling code for the file operations and it should become too
complex. So I abandoned it.

At Fri, 5 Aug 2016 12:16:13 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in <CAB7nPqQeMQ8K3Vwh+T7Gq0O3i-3kiRyNPue9WRgqJQVuFAbZrQ@mail.gmail.com>

On Thu, Aug 4, 2016 at 4:38 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

Andreas, with the patch attached is the assertion still triggered?

Thoughts from others are welcome as well.

Self review:
* of streaming base backups currently in progress. forcePageWrites is set
* to true when either of these is non-zero. lastBackupStart is the latest
* checkpoint redo location used as a starting point for an online backup.
+     * exclusiveBackupStarted is true while pg_start_backup() is being called
+     * during an exclusive backup.
*/
bool        exclusiveBackup;
int            nonExclusiveBackups;
XLogRecPtr    lastBackupStart;
+    bool        exclusiveBackupStarted;
It would be better to just use one variable that uses an enum to track
the following states of an exclusive backup: none, starting and
in-progress.

What I thought when I looked the first patch is that. Addition to
that I don't see the name exclusiveBackupStated is
appropriate.

One more argument is the necessity of the WALInsertLock at the
end of pg_start_backup. No other backend cannot reach there
concurrently and possible latency from cache coherency won't be a
matter for the purpose, I suppose. What do you think it is needed
for?

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Andreas Seltenreich
seltenreich@gmx.de
In reply to: Michael Paquier (#3)
Re: [sqlsmith] FailedAssertion("!(XLogCtl->Insert.exclusiveBackup)", File: "xlog.c", Line: 10200)

Michael Paquier writes:

Andreas, with the patch attached is the assertion still triggered?
[2. text/x-diff; base-backup-crash-v2.patch]

I didn't observe the crashes since applying this patch. There should
have been about five by the amount of fuzzing done.

regards
Andreas

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#7Michael Paquier
michael@paquier.xyz
In reply to: Andreas Seltenreich (#6)
Re: [sqlsmith] FailedAssertion("!(XLogCtl->Insert.exclusiveBackup)", File: "xlog.c", Line: 10200)

On Sat, Aug 6, 2016 at 6:35 PM, Andreas Seltenreich <seltenreich@gmx.de> wrote:

Michael Paquier writes:

Andreas, with the patch attached is the assertion still triggered?
[2. text/x-diff; base-backup-crash-v2.patch]

I didn't observe the crashes since applying this patch. There should
have been about five by the amount of fuzzing done.

I have reworked the patch, replacing those two booleans with a single
enum. That's definitely clearer. Also, I have added this patch to the
CF to not lose track of it:
https://commitfest.postgresql.org/10/731/
Horiguchi-san, I have added you as a reviewer as you provided some
input. I hope you don't mind.
--
Michael

Attachments:

base-backup-crash-v3.patchtext/x-patch; charset=US-ASCII; name=base-backup-crash-v3.patchDownload+50-13
#8Amit Kapila
amit.kapila16@gmail.com
In reply to: Michael Paquier (#7)
Re: Re: [sqlsmith] FailedAssertion("!(XLogCtl->Insert.exclusiveBackup)", File: "xlog.c", Line: 10200)

On Mon, Aug 22, 2016 at 7:13 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Sat, Aug 6, 2016 at 6:35 PM, Andreas Seltenreich <seltenreich@gmx.de> wrote:

Michael Paquier writes:

Andreas, with the patch attached is the assertion still triggered?
[2. text/x-diff; base-backup-crash-v2.patch]

I didn't observe the crashes since applying this patch. There should
have been about five by the amount of fuzzing done.

I have reworked the patch, replacing those two booleans with a single
enum. That's definitely clearer. Also, I have added this patch to the
CF to not lose track of it:
https://commitfest.postgresql.org/10/731/
Horiguchi-san, I have added you as a reviewer as you provided some
input. I hope you don't mind.

Won't the similar problem exists for nonExclusiveBackups? Basically
in similar situation, the count for nonExclusiveBackups will be
decremented and if it hits pg_start_backup_callback(), the following
Assertion can fail.
pg_start_backup_callback()
{
..
Assert(XLogCtl->Insert.nonExclusiveBackups > 0);
..
}

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#9Michael Paquier
michael@paquier.xyz
In reply to: Amit Kapila (#8)
Re: Re: [sqlsmith] FailedAssertion("!(XLogCtl->Insert.exclusiveBackup)", File: "xlog.c", Line: 10200)

On Mon, Aug 22, 2016 at 10:09 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Won't the similar problem exists for nonExclusiveBackups? Basically
in similar situation, the count for nonExclusiveBackups will be
decremented and if it hits pg_start_backup_callback(), the following
Assertion can fail.
pg_start_backup_callback()
{
..
Assert(XLogCtl->Insert.nonExclusiveBackups > 0);
..
}

This cannot be hit for non-exclusive backups as pg_start_backup() and
pg_stop_backup() need to be launched from the same session: their call
will never overlap across session, and this overlap is the reason why
exclusive backups are exposed to the problem.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10Fujii Masao
masao.fujii@gmail.com
In reply to: Michael Paquier (#7)
Re: Re: [sqlsmith] FailedAssertion("!(XLogCtl->Insert.exclusiveBackup)", File: "xlog.c", Line: 10200)

On Mon, Aug 22, 2016 at 10:43 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Sat, Aug 6, 2016 at 6:35 PM, Andreas Seltenreich <seltenreich@gmx.de> wrote:

Michael Paquier writes:

Andreas, with the patch attached is the assertion still triggered?
[2. text/x-diff; base-backup-crash-v2.patch]

I didn't observe the crashes since applying this patch. There should
have been about five by the amount of fuzzing done.

I have reworked the patch, replacing those two booleans with a single
enum. That's definitely clearer. Also, I have added this patch to the
CF to not lose track of it:
https://commitfest.postgresql.org/10/731/
Horiguchi-san, I have added you as a reviewer as you provided some
input. I hope you don't mind.

You seem to add another entry for this patch into CommitFest.
Either of them needs to be removed.
https://commitfest.postgresql.org/10/698/

This patch prevents pg_stop_backup from starting while pg_start_backup
is running. OTOH, we also should prevent pg_start_backup from starting
until "previous" pg_stop_backup has completed? What happens if
pg_start_backup starts while pg_stop_backup is running?
As far as I read the current code, ISTM that there is no need to do that,
but it's better to do the double check.

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11Michael Paquier
michael@paquier.xyz
In reply to: Fujii Masao (#10)
Re: Re: [sqlsmith] FailedAssertion("!(XLogCtl->Insert.exclusiveBackup)", File: "xlog.c", Line: 10200)

On Mon, Aug 29, 2016 at 11:16 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

You seem to add another entry for this patch into CommitFest.
Either of them needs to be removed.
https://commitfest.postgresql.org/10/698/

Indeed. I just removed this one.

This patch prevents pg_stop_backup from starting while pg_start_backup
is running. OTOH, we also should prevent pg_start_backup from starting
until "previous" pg_stop_backup has completed? What happens if
pg_start_backup starts while pg_stop_backup is running?
As far as I read the current code, ISTM that there is no need to do that,
but it's better to do the double check.

I don't agree about doing that. The on-disk presence of the
backup_label file is what prevents pg_start_backup to run should
pg_stop_backup have already decremented its counters but not unlinked
yet the backup_label, so this insurance is really enough IMO. One good
reason to keep pg_stop_backup as-is in its failure handling is that if
for example it fails to remove the backup_label file, it is still
possible to do again a backup by just removing manually the existing
backup_label file *without* restarting the server. If you have an
in-memory state it would not be possible to fallback to this method
and you'd need a method to clean up the in-memory state.

Now, well, with the patch as well as HEAD, it could be possible that
the backup counters are decremented, but pg_stop_backup *fails* to
remove the backup_label file which would prevent any subsequent
pg_start_backup to run, because there is still a backup_label file, as
well as any other pg_stop_backup to run, because there is no backup
running per the in-memory state. We could improve the in-memory state
by:
- Having an extra exclusive backup status to mark an exclusive backup
as stopping, say EXCLUSIVE_BACKUP_STOPPING. Then mark the exclusive
backup status as such when do_pg_stop_backup starts.
- delaying counter decrementation in pg_stop_backup to the moment when
the backup_label file has been removed.
- Taking an extra time the exclusive WAL insert lock after
backup_label is removed, and decrement the counters.
- During the time the backup_label file is removed, we need an error
callback to switch back to BACKUP_RUNNING in case of error, similarly
to do_pg_start_backup.
Which is more or less the attached patch. Now, if pg_stop_backup
fails, this has the disadvantage to force a server restart to clean up
the in-memory state, instead of just removing manually the existing
backup_file.

For those reasons I still strongly recommend using v3, and not meddle
with the error handling of pg_stop_backup. Because, well, that's
actually not necessary, and this would just hurt the error handling of
exclusive backups.
--
Michael

Attachments:

base-backup-crash-v4.patchapplication/x-patch; name=base-backup-crash-v4.patchDownload+160-68
#12Fujii Masao
masao.fujii@gmail.com
In reply to: Michael Paquier (#11)
Re: Re: [sqlsmith] FailedAssertion("!(XLogCtl->Insert.exclusiveBackup)", File: "xlog.c", Line: 10200)

On Tue, Aug 30, 2016 at 1:32 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Mon, Aug 29, 2016 at 11:16 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

You seem to add another entry for this patch into CommitFest.
Either of them needs to be removed.
https://commitfest.postgresql.org/10/698/

Indeed. I just removed this one.

This patch prevents pg_stop_backup from starting while pg_start_backup
is running. OTOH, we also should prevent pg_start_backup from starting
until "previous" pg_stop_backup has completed? What happens if
pg_start_backup starts while pg_stop_backup is running?
As far as I read the current code, ISTM that there is no need to do that,
but it's better to do the double check.

I don't agree about doing that.

Hmm... my previous comment was confusing. I wanted to comment "it's better
*also for you* to do the double check whether we need to prevent pg_start_backup
while pg_stop_backup is running or not". Your answer seems to be almost the same
as mine, i.e., not necessary. Right?

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13Michael Paquier
michael@paquier.xyz
In reply to: Fujii Masao (#12)
Re: Re: [sqlsmith] FailedAssertion("!(XLogCtl->Insert.exclusiveBackup)", File: "xlog.c", Line: 10200)

On Tue, Aug 30, 2016 at 1:44 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Tue, Aug 30, 2016 at 1:32 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Mon, Aug 29, 2016 at 11:16 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

You seem to add another entry for this patch into CommitFest.
Either of them needs to be removed.
https://commitfest.postgresql.org/10/698/

Indeed. I just removed this one.

This patch prevents pg_stop_backup from starting while pg_start_backup
is running. OTOH, we also should prevent pg_start_backup from starting
until "previous" pg_stop_backup has completed? What happens if
pg_start_backup starts while pg_stop_backup is running?
As far as I read the current code, ISTM that there is no need to do that,
but it's better to do the double check.

I don't agree about doing that.

Hmm... my previous comment was confusing. I wanted to comment "it's better
*also for you* to do the double check whether we need to prevent pg_start_backup
while pg_stop_backup is running or not". Your answer seems to be almost the same
as mine, i.e., not necessary. Right?

Yes, that's not necessary to do more. In the worst case, say
pg_start_backup starts when pg_stop_backup is running. And
pg_stop_backup has decremented the backup counters, but not yet
removed the backup_label, then pg_start_backup would just choke on the
presence of the backup_label file.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Michael Paquier (#13)
Re: Re: [sqlsmith] FailedAssertion("!(XLogCtl->Insert.exclusiveBackup)", File: "xlog.c", Line: 10200)

Hello,

(the header of this message is crafted so it might be isolate
this message from the thread)

The patch still applies on the current master with disaplacements.

On Tue, Aug 30, 2016 at 1:44 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Tue, Aug 30, 2016 at 1:32 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Mon, Aug 29, 2016 at 11:16 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

You seem to add another entry for this patch into CommitFest.
Either of them needs to be removed.
https://commitfest.postgresql.org/10/698/

Indeed. I just removed this one.

This patch prevents pg_stop_backup from starting while pg_start_backup
is running. OTOH, we also should prevent pg_start_backup from starting
until "previous" pg_stop_backup has completed? What happens if
pg_start_backup starts while pg_stop_backup is running?
As far as I read the current code, ISTM that there is no need to do that,
but it's better to do the double check.

I don't agree about doing that.

Hmm... my previous comment was confusing. I wanted to comment "it's better
*also for you* to do the double check whether we need to prevent pg_start_backup
while pg_stop_backup is running or not". Your answer seems to be almost the same
as mine, i.e., not necessary. Right?

Yes, that's not necessary to do more. In the worst case, say
pg_start_backup starts when pg_stop_backup is running. And
pg_stop_backup has decremented the backup counters, but not yet
removed the backup_label, then pg_start_backup would just choke on the
presence of the backup_label file

I don't see any problem on the state-transition of
exclusiveBackupState. For the following part

@@ -10217,7 +10255,7 @@ do_pg_start_backup(const char *backupidstr, bool fast, TimeLineID *starttli_p,
     {
       /*
        * Check for existing backup label --- implies a backup is already
-       * running.  (XXX given that we checked exclusiveBackup above,
+       * running.  (XXX given that we checked exclusiveBackupState above,
        * maybe it would be OK to just unlink any such label file?)

The issue in the XXX comment should be settled by this
chance. PostgreSQL no longer (or should not) places the file by
mistake of itself. It is only possible by someone evil crafting
it, or crash-then-restarted. For the former it can be safely
removed with some notice. For the latter we should remove it
since the backup taken through the restarting is not
reliable. Addition to that, issueing pg_start_backup indicates
that the operator already have forgotten that he/she issued it
previously. So it seems to me that it can be removed with some
WARNING.

The error checking on exclusiveBackupState looks somewhat
redandunt but I don't come up with a better coding.

So, everything other than the XXX comment looks fine for me, and
this is rather straghtforward patch. So after deciding what to do
for the issue and anyone opposed to this patch, I'll make this
'Ready for commiter'.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15Michael Paquier
michael@paquier.xyz
In reply to: Kyotaro Horiguchi (#14)
Re: Re: [sqlsmith] FailedAssertion("!(XLogCtl->Insert.exclusiveBackup)", File: "xlog.c", Line: 10200)

On Fri, Nov 4, 2016 at 5:36 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

On Tue, Aug 30, 2016 at 1:44 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Tue, Aug 30, 2016 at 1:32 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Mon, Aug 29, 2016 at 11:16 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

You seem to add another entry for this patch into CommitFest.
Either of them needs to be removed.
https://commitfest.postgresql.org/10/698/

Indeed. I just removed this one.

This patch prevents pg_stop_backup from starting while pg_start_backup
is running. OTOH, we also should prevent pg_start_backup from starting
until "previous" pg_stop_backup has completed? What happens if
pg_start_backup starts while pg_stop_backup is running?
As far as I read the current code, ISTM that there is no need to do that,
but it's better to do the double check.

I don't agree about doing that.

Hmm... my previous comment was confusing. I wanted to comment "it's better
*also for you* to do the double check whether we need to prevent pg_start_backup
while pg_stop_backup is running or not". Your answer seems to be almost the same
as mine, i.e., not necessary. Right?

Yes, that's not necessary to do more. In the worst case, say
pg_start_backup starts when pg_stop_backup is running. And
pg_stop_backup has decremented the backup counters, but not yet
removed the backup_label, then pg_start_backup would just choke on the
presence of the backup_label file

I don't see any problem on the state-transition of
exclusiveBackupState. For the following part

@@ -10217,7 +10255,7 @@ do_pg_start_backup(const char *backupidstr, bool fast, TimeLineID *starttli_p,
{
/*
* Check for existing backup label --- implies a back>

up is already

-       * running.  (XXX given that we checked exclusiveBackup above,
+       * running.  (XXX given that we checked exclusiveBackupState above,
* maybe it would be OK to just unlink any such label file?)

The issue in the XXX comment should be settled by this
chance. PostgreSQL no longer (or should not) places the file by
mistake of itself. It is only possible by someone evil crafting
it, or crash-then-restarted. For the former it can be safely
removed with some notice. For the latter we should remove it
since the backup taken through the restarting is not
reliable. Addition to that, issueing pg_start_backup indicates
that the operator already have forgotten that he/she issued it
previously. So it seems to me that it can be removed with some
WARNING.

The error checking on exclusiveBackupState looks somewhat
redandunt but I don't come up with a better coding.

Yes, that's on purpose to make the error messages more verbose for the user.

So, everything other than the XXX comment looks fine for me, and
this is rather straghtforward patch.

I agree that we could do something better, but that would be an
optimization, not a bug fix which is what this patch is aiming at.

So after deciding what to do
for the issue and anyone opposed to this patch, I'll make this
'Ready for commiter'.

Thanks for the input.

By the way, as long as I look at that, the patch applies cleanly on
master and 9.6 but not further down. If a committer shows up to push
this patch, I can prepare versions for back branches as needed.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Michael Paquier (#15)
Re: Re: [sqlsmith] FailedAssertion("!(XLogCtl->Insert.exclusiveBackup)", File: "xlog.c", Line: 10200)

Hello,

At Sat, 5 Nov 2016 21:18:42 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in <CAB7nPqS8FQf3T3ZLw=v-FMMbUEM_kKcVzfxybTnG68u-SfZJ7Q@mail.gmail.com>

I don't see any problem on the state-transition of
exclusiveBackupState. For the following part

@@ -10217,7 +10255,7 @@ do_pg_start_backup(const char *backupidstr, bool fast, TimeLineID *starttli_p,
{
/*
* Check for existing backup label --- implies a back>

up is already

-       * running.  (XXX given that we checked exclusiveBackup above,
+       * running.  (XXX given that we checked exclusiveBackupState above,
* maybe it would be OK to just unlink any such label file?)

The issue in the XXX comment should be settled by this
chance. PostgreSQL no longer (or should not) places the file by
mistake of itself. It is only possible by someone evil crafting
it, or crash-then-restarted. For the former it can be safely
removed with some notice. For the latter we should remove it
since the backup taken through the restarting is not
reliable. Addition to that, issueing pg_start_backup indicates
that the operator already have forgotten that he/she issued it
previously. So it seems to me that it can be removed with some
WARNING.

The error checking on exclusiveBackupState looks somewhat
redandunt but I don't come up with a better coding.

Yes, that's on purpose to make the error messages more verbose for the user.

So, everything other than the XXX comment looks fine for me, and
this is rather straghtforward patch.

I agree that we could do something better, but that would be an
optimization, not a bug fix which is what this patch is aiming at.

Ok, I understand that.

So after deciding what to do
for the issue and anyone opposed to this patch, I'll make this
'Ready for commiter'.

Thanks for the input.

By the way, as long as I look at that, the patch applies cleanly on
master and 9.6 but not further down. If a committer shows up to push
this patch, I can prepare versions for back branches as needed.

Ok, the attached is the version just rebased to the current
master. It is clieanly appliable without whitespace errors on
master and 9.6 with some displacements but fails on 9.5.

I will mark this as "Ready for Committer".

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

base-backup-crash-v5.patchtext/x-patch; charset=us-asciiDownload+160-68
#17Michael Paquier
michael@paquier.xyz
In reply to: Kyotaro Horiguchi (#16)
Re: Re: [sqlsmith] FailedAssertion("!(XLogCtl->Insert.exclusiveBackup)", File: "xlog.c", Line: 10200)

On Mon, Nov 7, 2016 at 6:18 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

I will mark this as "Ready for Committer".

I have just noticed that Robert has switched this patch to "needs
review" by mistake (I think that there was a mistake with another
patch), so I have switched it back to "Ready for committer". I agree
with Horiguchi-san that this seems adapted, as it got some review and
some love.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18Michael Paquier
michael@paquier.xyz
In reply to: Michael Paquier (#17)
Re: Re: [sqlsmith] FailedAssertion("!(XLogCtl->Insert.exclusiveBackup)", File: "xlog.c", Line: 10200)

On Thu, Nov 17, 2016 at 1:18 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Mon, Nov 7, 2016 at 6:18 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

I will mark this as "Ready for Committer".

I have just noticed that Robert has switched this patch to "needs
review" by mistake (I think that there was a mistake with another
patch), so I have switched it back to "Ready for committer". I agree
with Horiguchi-san that this seems adapted, as it got some review and
some love.

Moved to next CF.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19Fujii Masao
masao.fujii@gmail.com
In reply to: Kyotaro Horiguchi (#16)
Re: Re: [sqlsmith] FailedAssertion("!(XLogCtl->Insert.exclusiveBackup)", File: "xlog.c", Line: 10200)

On Tue, Nov 8, 2016 at 11:18 AM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

Hello,

At Sat, 5 Nov 2016 21:18:42 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in <CAB7nPqS8FQf3T3ZLw=v-FMMbUEM_kKcVzfxybTnG68u-SfZJ7Q@mail.gmail.com>

I don't see any problem on the state-transition of
exclusiveBackupState. For the following part

@@ -10217,7 +10255,7 @@ do_pg_start_backup(const char *backupidstr, bool fast, TimeLineID *starttli_p,
{
/*
* Check for existing backup label --- implies a back>

up is already

-       * running.  (XXX given that we checked exclusiveBackup above,
+       * running.  (XXX given that we checked exclusiveBackupState above,
* maybe it would be OK to just unlink any such label file?)

The issue in the XXX comment should be settled by this
chance. PostgreSQL no longer (or should not) places the file by
mistake of itself. It is only possible by someone evil crafting
it, or crash-then-restarted. For the former it can be safely
removed with some notice. For the latter we should remove it
since the backup taken through the restarting is not
reliable. Addition to that, issueing pg_start_backup indicates
that the operator already have forgotten that he/she issued it
previously. So it seems to me that it can be removed with some
WARNING.

The error checking on exclusiveBackupState looks somewhat
redandunt but I don't come up with a better coding.

Yes, that's on purpose to make the error messages more verbose for the user.

So, everything other than the XXX comment looks fine for me, and
this is rather straghtforward patch.

I agree that we could do something better, but that would be an
optimization, not a bug fix which is what this patch is aiming at.

Ok, I understand that.

So after deciding what to do
for the issue and anyone opposed to this patch, I'll make this
'Ready for commiter'.

Thanks for the input.

By the way, as long as I look at that, the patch applies cleanly on
master and 9.6 but not further down. If a committer shows up to push
this patch, I can prepare versions for back branches as needed.

Ok, the attached is the version just rebased to the current
master. It is clieanly appliable without whitespace errors on
master and 9.6 with some displacements but fails on 9.5.

I will mark this as "Ready for Committer".

Thanks for updating the patch! Here are the review comments;

+ * EXCLUSIVE_BACKUP_NONE means that there is no exclusive backup actually
+ * running, to be more precise pg_start_backup() is not being executed for
+ * an exclusive backup and there is exclusive backup in progress.

"there is exclusive" should be "there is no exclusive"?
ISTM that the description following "to be more" doesn't seem to be necessary.

+ * EXCLUSIVE_BACKUP_STARTING means that pg_start_backup() is running and
+ * that is is not done yet.
+ * EXCLUSIVE_BACKUP_STOPPING means that pg_stop_backup() is running and
+ * that is is not done yet.

Typo: "is is"

Isn't it better to mention "an exclusive backup" here? What about

EXCLUSIVE_BACKUP_STARTING means that pg_start_backup() is starting an exclusive
backup.
EXCLUSIVE_BACKUP_STOPPING means that pg_stop_backup() is stopping an exclusive
backup.

I think that it's worth explaining why ExclusiveBackupState is necessary,
in the comment.

-     * exclusiveBackup is true if a backup started with pg_start_backup() is
-     * in progress, and nonExclusiveBackups is a counter indicating the number
+     * exclusiveBackupState is changed to EXCLUSIVE_BACKUP_STARTING for the
+     * duration of pg_start_backup(), then to EXCLUSIVE_BACKUP_IN_PROGRESS once
+     * it is done, and nonExclusiveBackups is a counter indicating the number

Why didn't you mention EXCLUSIVE_BACKUP_STOPPING and _NONE?
Basically it's better to explain fully how the state changes.

+                    (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+                     errmsg("a backup is already starting")));
+        }
+        if (XLogCtl->Insert.exclusiveBackupState == EXCLUSIVE_BACKUP_STOPPING)
+        {
+            WALInsertLockRelease();
+            ereport(ERROR,
+                    (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+                     errmsg("a backup is currently stopping")));

Isn't it better to use "an exclusive backup" explicitly rather than
"a backup" here?

You changed the code so that pg_stop_backup() removes backup_label before
it marks the exclusive-backup-state as not-in-progress. Could you tell me
why you did this change? It's better to explain it in the comment.

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Fujii Masao (#19)
Re: Re: [sqlsmith] FailedAssertion("!(XLogCtl->Insert.exclusiveBackup)", File: "xlog.c", Line: 10200)

Thank you for the comment.

At Tue, 13 Dec 2016 12:49:12 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in <CAHGQGwHPZ8mwx=FcAxB2kaydhBjM-di7mxy6C2B3V5oWtddp_Q@mail.gmail.com>

On Tue, Nov 8, 2016 at 11:18 AM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

By the way, as long as I look at that, the patch applies cleanly on
master and 9.6 but not further down. If a committer shows up to push
this patch, I can prepare versions for back branches as needed.

Ok, the attached is the version just rebased to the current
master. It is clieanly appliable without whitespace errors on
master and 9.6 with some displacements but fails on 9.5.

I will mark this as "Ready for Committer".

Thanks for updating the patch! Here are the review comments;

+ * EXCLUSIVE_BACKUP_NONE means that there is no exclusive backup actually
+ * running, to be more precise pg_start_backup() is not being executed for
+ * an exclusive backup and there is exclusive backup in progress.

"there is exclusive" should be "there is no exclusive"?
ISTM that the description following "to be more" doesn't seem to be necessary.

I agree. One can know that by reading the description on
EXCLUSIVE_BACKUP_STARTING (and STOPPING, which is not mentioned here)

+ * EXCLUSIVE_BACKUP_STARTING means that pg_start_backup() is running and
+ * that is is not done yet.
+ * EXCLUSIVE_BACKUP_STOPPING means that pg_stop_backup() is running and
+ * that is is not done yet.

Typo: "is is"

Oops. Sorry for overlooking such a silly typo.

Isn't it better to mention "an exclusive backup" here? What about

EXCLUSIVE_BACKUP_STARTING means that pg_start_backup() is starting an exclusive
backup.
EXCLUSIVE_BACKUP_STOPPING means that pg_stop_backup() is stopping an exclusive
backup.

It seems fine. Done in the attached version.

I think that it's worth explaining why ExclusiveBackupState is necessary,
in the comment.

The attached version contains the following comment.

| * State of an exclusive backup.
| *
| * EXCLUSIVE_BACKUP_STARTING and EXCLUSIVE_BACKUP_STOPPING blocks
| * pg_start_backup and pg_stop_backup to prevent a race condition. Both of the
| * functions returns error on these state.

Does this make sense?

-     * exclusiveBackup is true if a backup started with pg_start_backup() is
-     * in progress, and nonExclusiveBackups is a counter indicating the number
+     * exclusiveBackupState is changed to EXCLUSIVE_BACKUP_STARTING for the
+     * duration of pg_start_backup(), then to EXCLUSIVE_BACKUP_IN_PROGRESS once
+     * it is done, and nonExclusiveBackups is a counter indicating the number

Why didn't you mention EXCLUSIVE_BACKUP_STOPPING and _NONE?
Basically it's better to explain fully how the state changes.

Hmm, it is modfied as the following in the attached.

| * exclusiveBackupState indicates state of an exlusive backup, see
| * ExclusiveBackupState for the detail.

and the comment for ExclusiveBackupState explains the state
transition. Does this work for you?

+                    (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+                     errmsg("a backup is already starting")));
+        }
+        if (XLogCtl->Insert.exclusiveBackupState == EXCLUSIVE_BACKUP_STOPPING)
+        {
+            WALInsertLockRelease();
+            ereport(ERROR,
+                    (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+                     errmsg("a backup is currently stopping")));

Isn't it better to use "an exclusive backup" explicitly rather than
"a backup" here?

pg_stop_backup already says like that. To unify with them, the
messages of pg_start_backup is modifed as the following.

exlusive backup is already starting
exlusive backup is currently stopping
exlusive backup is currently in progress

You changed the code so that pg_stop_backup() removes backup_label before
it marks the exclusive-backup-state as not-in-progress. Could you tell me
why you did this change? It's better to explain it in the comment.

Previously the backup_label is the traffic signal for the backup
state and that is the cause of this problem. Now the
exclusiveBackupState takes the place of it, so it naturally
should be _NONE after all steps have been finished. It seems to
me so natural so that no additional explanation is needed.

If they are done in the opposite order, the existence check of
backup_label still left in pg_start_backup may fire.

| * Check for existing backup label --- implies a backup is already
| * running. (XXX given that we checked exclusiveBackupState above,
| * maybe it would be OK to just unlink any such label file?)
| */
| if (stat(BACKUP_LABEL_FILE, &stat_buf) != 0)
| {
| if (errno != ENOENT)
| ereport(ERROR,

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

base-backup-crash-v6.patchtext/x-patch; charset=us-asciiDownload+166-72
#21Michael Paquier
michael@paquier.xyz
In reply to: Fujii Masao (#19)
#22Michael Paquier
michael@paquier.xyz
In reply to: Kyotaro Horiguchi (#20)
#23Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Michael Paquier (#22)
#24Michael Paquier
michael@paquier.xyz
In reply to: Kyotaro Horiguchi (#23)
#25Fujii Masao
masao.fujii@gmail.com
In reply to: Michael Paquier (#21)
#26Fujii Masao
masao.fujii@gmail.com
In reply to: Fujii Masao (#25)
#27Michael Paquier
michael@paquier.xyz
In reply to: Fujii Masao (#25)