Assertion failure when the non-exclusive pg_stop_backup aborted.

Started by Masahiko Sawadaover 8 years ago45 messageshackers
Jump to latest
#1Masahiko Sawada
sawada.mshk@gmail.com

Hi,

I got an assert failure when executing pg_terminate_backend to the
process that waiting for WAL to be archived in non-exclusive backup
mode.

TRAP: FailedAssertion("!(XLogCtl->Insert.nonExclusiveBackups > 0)",
File: "xlog.c", Line: 11123)

The one reproducing procedure is,
1. Start postgres with enabling the WAL archive
2. Execute pg_start_backup()
3. Revoke the access privileges of archive directory. (e.g., chown
root:root /path/to/archive)
4. Execute pg_stop_backup() and hangs
5. Execute pg_terminate_backend() to the process that is waiting for
WAL to be archived.
Got the assertion failure.

Perhaps we can reproduce it using pg_basebackup as well.

I think the cause of this is that do_pg_abort_backup can be called
even after we decremented XLogCtl->Insert.nonExclusiveBackups in
do_pg_stop_backup(). That is, do_pg_abort_backup can be called with
XLogCtl->Insert.nonExclusiveBackups = 0 when the transaction aborts
after processed the nonExclusiveBackups (e.g, during waiting for WAL
to be archived)
The bug can happen in PostgreSQL 9.1 or higher that non-exclusive
backup has been introduced, so we should back-patch to the all
supported versions.

I changed do_pg_abort_backup() so that it decrements
nonExclusiveBackups only if > 0. Feedback is very welcome.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachments:

fix_do_pg_abort_backup.patchtext/x-patch; charset=US-ASCII; name=fix_do_pg_abort_backup.patchDownload+9-2
#2Michael Paquier
michael@paquier.xyz
In reply to: Masahiko Sawada (#1)
Re: Assertion failure when the non-exclusive pg_stop_backup aborted.

On Thu, Sep 21, 2017 at 1:07 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

The bug can happen in PostgreSQL 9.1 or higher that non-exclusive
backup has been introduced, so we should back-patch to the all
supported versions.

There is a typo here right? Non-exclusive backups have been introduced
in 9.6. Why would a back-patch further down be needed?

I changed do_pg_abort_backup() so that it decrements
nonExclusiveBackups only if > 0. Feedback is very welcome.

+-  Assert(XLogCtl->Insert.nonExclusiveBackups >= 0);
+   if (XLogCtl->Insert.nonExclusiveBackups > 0)
+       XLogCtl->Insert.nonExclusiveBackups--;
Hm, no, I don't agree. I think that instead you should just leave
do_pg_abort_backup() immediately if sessionBackupState is set to
SESSION_BACKUP_NONE. This variable is the link between the global
counters and the session stopping the backup so I don't think that we
should touch this assertion of this counter. I think that this method
would be safe as well for backup start as pg_start_backup_callback
takes care of any cleanup. Also because the counters are incremented
before entering in the PG_ENSURE_ERROR_CLEANUP block, and
sessionBackupState is updated just after leaving the block.

About the patch:
+- Assert(XLogCtl->Insert.nonExclusiveBackups >= 0);
There is a typo on this line.

Adding that to the next CF would be a good idea so as we don't forget
about it. Feel free to add me as reviewer.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Michael Paquier (#2)
Re: Assertion failure when the non-exclusive pg_stop_backup aborted.

On Thu, Sep 21, 2017 at 2:25 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Thu, Sep 21, 2017 at 1:07 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

The bug can happen in PostgreSQL 9.1 or higher that non-exclusive
backup has been introduced, so we should back-patch to the all
supported versions.

There is a typo here right? Non-exclusive backups have been introduced
in 9.6. Why would a back-patch further down be needed?

I think the non-exclusive backups infrastructure has been introduced
in 9.1 for pg_basebackup. I've not checked yet that it can be
reproduced using pg_basebackup in PG9.1 but I thought it could happen
as far as I checked the code.

I changed do_pg_abort_backup() so that it decrements
nonExclusiveBackups only if > 0. Feedback is very welcome.

+-  Assert(XLogCtl->Insert.nonExclusiveBackups >= 0);
+   if (XLogCtl->Insert.nonExclusiveBackups > 0)
+       XLogCtl->Insert.nonExclusiveBackups--;
Hm, no, I don't agree. I think that instead you should just leave
do_pg_abort_backup() immediately if sessionBackupState is set to
SESSION_BACKUP_NONE. This variable is the link between the global
counters and the session stopping the backup so I don't think that we
should touch this assertion of this counter. I think that this method
would be safe as well for backup start as pg_start_backup_callback
takes care of any cleanup. Also because the counters are incremented
before entering in the PG_ENSURE_ERROR_CLEANUP block, and
sessionBackupState is updated just after leaving the block.

I think that the assertion failure still can happen if the process
aborts after decremented the counter and before setting to
SESSION_BACKUP_NONE. Am I missing something?

About the patch:
+- Assert(XLogCtl->Insert.nonExclusiveBackups >= 0);
There is a typo on this line.

Adding that to the next CF would be a good idea so as we don't forget
about it. Feel free to add me as reviewer.

Thank you!

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Michael Paquier
michael@paquier.xyz
In reply to: Masahiko Sawada (#3)
Re: Assertion failure when the non-exclusive pg_stop_backup aborted.

On Thu, Sep 21, 2017 at 4:40 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Sep 21, 2017 at 2:25 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Thu, Sep 21, 2017 at 1:07 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

The bug can happen in PostgreSQL 9.1 or higher that non-exclusive
backup has been introduced, so we should back-patch to the all
supported versions.

There is a typo here right? Non-exclusive backups have been introduced
in 9.6. Why would a back-patch further down be needed?

I think the non-exclusive backups infrastructure has been introduced
in 9.1 for pg_basebackup. I've not checked yet that it can be
reproduced using pg_basebackup in PG9.1 but I thought it could happen
as far as I checked the code.

Yep, but the deficiency is caused by the use before_shmem_exit() in
the SQL-level functions present in 9.6~ which make the cleanup happen
potentially twice. If you look at the code of basebackup.c, you would
notice that the cleanups of the counters only happen within the
PG_ENSURE_ERROR_CLEANUP() blocks, so a backpatch to 9.6 should be
enough.

+-  Assert(XLogCtl->Insert.nonExclusiveBackups >= 0);
+   if (XLogCtl->Insert.nonExclusiveBackups > 0)
+       XLogCtl->Insert.nonExclusiveBackups--;
Hm, no, I don't agree. I think that instead you should just leave
do_pg_abort_backup() immediately if sessionBackupState is set to
SESSION_BACKUP_NONE. This variable is the link between the global
counters and the session stopping the backup so I don't think that we
should touch this assertion of this counter. I think that this method
would be safe as well for backup start as pg_start_backup_callback
takes care of any cleanup. Also because the counters are incremented
before entering in the PG_ENSURE_ERROR_CLEANUP block, and
sessionBackupState is updated just after leaving the block.

I think that the assertion failure still can happen if the process
aborts after decremented the counter and before setting to
SESSION_BACKUP_NONE. Am I missing something?

The process would stop at the next CHECK_FOR_INTERRUPTS() and trigger
the cleanup at this moment. So this happens when waiting for the
archives to be done, and the session flag is set to NONE at this
point.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Michael Paquier
michael@paquier.xyz
In reply to: Michael Paquier (#4)
Re: Assertion failure when the non-exclusive pg_stop_backup aborted.

On Thu, Sep 21, 2017 at 5:56 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Thu, Sep 21, 2017 at 4:40 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Sep 21, 2017 at 2:25 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

+-  Assert(XLogCtl->Insert.nonExclusiveBackups >= 0);
+   if (XLogCtl->Insert.nonExclusiveBackups > 0)
+       XLogCtl->Insert.nonExclusiveBackups--;
Hm, no, I don't agree. I think that instead you should just leave
do_pg_abort_backup() immediately if sessionBackupState is set to
SESSION_BACKUP_NONE. This variable is the link between the global
counters and the session stopping the backup so I don't think that we
should touch this assertion of this counter. I think that this method
would be safe as well for backup start as pg_start_backup_callback
takes care of any cleanup. Also because the counters are incremented
before entering in the PG_ENSURE_ERROR_CLEANUP block, and
sessionBackupState is updated just after leaving the block.

I think that the assertion failure still can happen if the process
aborts after decremented the counter and before setting to
SESSION_BACKUP_NONE. Am I missing something?

The process would stop at the next CHECK_FOR_INTERRUPTS() and trigger
the cleanup at this moment. So this happens when waiting for the
archives to be done, and the session flag is set to NONE at this
point.

And actually, with two non-exclusive backups taken in parallel, it is
still possible to fail on another assertion in do_pg_stop_backup()
even with your patch. Imagine the following:
1) Two backups are taken, counter for non-exclusive backups is set at 2.
2) One backup is stopped, then interrupted. This causes the counter to
be decremented twice, once in do_pg_stop_backup, and once when
aborting. Counter is at 0, switching as well forcePageWrites to
false..
3) The second backup stops, a new assertion failure is triggered.
Without assertions the counter would get at -1.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Craig Ringer
craig@2ndquadrant.com
In reply to: Michael Paquier (#4)
Re: Assertion failure when the non-exclusive pg_stop_backup aborted.

On 21 September 2017 at 16:56, Michael Paquier <michael.paquier@gmail.com>
wrote:

On Thu, Sep 21, 2017 at 4:40 PM, Masahiko Sawada <sawada.mshk@gmail.com>
wrote:

On Thu, Sep 21, 2017 at 2:25 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Thu, Sep 21, 2017 at 1:07 AM, Masahiko Sawada <sawada.mshk@gmail.com>

wrote:

The bug can happen in PostgreSQL 9.1 or higher that non-exclusive
backup has been introduced, so we should back-patch to the all
supported versions.

There is a typo here right? Non-exclusive backups have been introduced
in 9.6. Why would a back-patch further down be needed?

I think the non-exclusive backups infrastructure has been introduced
in 9.1 for pg_basebackup. I've not checked yet that it can be
reproduced using pg_basebackup in PG9.1 but I thought it could happen
as far as I checked the code.

Yep, but the deficiency is caused by the use before_shmem_exit() in
the SQL-level functions present in 9.6~ which make the cleanup happen
potentially twice. If you look at the code of basebackup.c, you would
notice that the cleanups of the counters only happen within the
PG_ENSURE_ERROR_CLEANUP() blocks, so a backpatch to 9.6 should be
enough.

+-  Assert(XLogCtl->Insert.nonExclusiveBackups >= 0);
+   if (XLogCtl->Insert.nonExclusiveBackups > 0)
+       XLogCtl->Insert.nonExclusiveBackups--;
Hm, no, I don't agree. I think that instead you should just leave
do_pg_abort_backup() immediately if sessionBackupState is set to
SESSION_BACKUP_NONE. This variable is the link between the global
counters and the session stopping the backup so I don't think that we
should touch this assertion of this counter. I think that this method
would be safe as well for backup start as pg_start_backup_callback
takes care of any cleanup. Also because the counters are incremented
before entering in the PG_ENSURE_ERROR_CLEANUP block, and
sessionBackupState is updated just after leaving the block.

I think that the assertion failure still can happen if the process
aborts after decremented the counter and before setting to
SESSION_BACKUP_NONE. Am I missing something?

The process would stop at the next CHECK_FOR_INTERRUPTS() and trigger
the cleanup at this moment. So this happens when waiting for the
archives to be done, and the session flag is set to NONE at this
point.

Another one to watch out for is that elog(...) and ereport(...) invoke
CHECK_FOR_INTERRUPTS. That's given me exciting surprises before when
combined with assertion checking and various exit cleanup hooks.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#7Michael Paquier
michael@paquier.xyz
In reply to: Craig Ringer (#6)
Re: Assertion failure when the non-exclusive pg_stop_backup aborted.

On Fri, Sep 22, 2017 at 10:41 AM, Craig Ringer <craig@2ndquadrant.com> wrote:

Another one to watch out for is that elog(...) and ereport(...) invoke
CHECK_FOR_INTERRUPTS. That's given me exciting surprises before when
combined with assertion checking and various exit cleanup hooks.

Ahah. Good point. In this case LWLockWaitForVar() is one thing to
worry about if LWLock tracing is enabled because it can log things
before holding the existing interrupts. This can be avoided easily in
the case of this issue by updating sessionBackupState before calling
WALInsertLockRelease and while holding the WAL insert lock. Surely
this meritates a comment in the patch we want here.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Michael Paquier (#5)
Re: Assertion failure when the non-exclusive pg_stop_backup aborted.

On Thu, Sep 21, 2017 at 5:56 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

Yep, but the deficiency is caused by the use before_shmem_exit() in
the SQL-level functions present in 9.6~ which make the cleanup happen
potentially twice. If you look at the code of basebackup.c, you would
notice that the cleanups of the counters only happen within the
PG_ENSURE_ERROR_CLEANUP() blocks, so a backpatch to 9.6 should be
enough.

Thank you for explaining. I understood and agreed..

On Fri, Sep 22, 2017 at 8:02 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Thu, Sep 21, 2017 at 5:56 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Thu, Sep 21, 2017 at 4:40 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Sep 21, 2017 at 2:25 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

+-  Assert(XLogCtl->Insert.nonExclusiveBackups >= 0);
+   if (XLogCtl->Insert.nonExclusiveBackups > 0)
+       XLogCtl->Insert.nonExclusiveBackups--;
Hm, no, I don't agree. I think that instead you should just leave
do_pg_abort_backup() immediately if sessionBackupState is set to
SESSION_BACKUP_NONE. This variable is the link between the global
counters and the session stopping the backup so I don't think that we
should touch this assertion of this counter. I think that this method
would be safe as well for backup start as pg_start_backup_callback
takes care of any cleanup. Also because the counters are incremented
before entering in the PG_ENSURE_ERROR_CLEANUP block, and
sessionBackupState is updated just after leaving the block.

I think that the assertion failure still can happen if the process
aborts after decremented the counter and before setting to
SESSION_BACKUP_NONE. Am I missing something?

The process would stop at the next CHECK_FOR_INTERRUPTS() and trigger
the cleanup at this moment. So this happens when waiting for the
archives to be done, and the session flag is set to NONE at this
point.

And actually, with two non-exclusive backups taken in parallel, it is
still possible to fail on another assertion in do_pg_stop_backup()
even with your patch. Imagine the following:
1) Two backups are taken, counter for non-exclusive backups is set at 2.
2) One backup is stopped, then interrupted. This causes the counter to
be decremented twice, once in do_pg_stop_backup, and once when
aborting. Counter is at 0, switching as well forcePageWrites to
false..
3) The second backup stops, a new assertion failure is triggered.
Without assertions the counter would get at -1.

You're right. I updated the patch so that it exits do_pg_abort_backup
if the state is NONE and setting the state to NONE in
do_pg_stop_backup before releasing the WAL insert lock.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachments:

fix_do_pg_abort_backup_v2.patchapplication/octet-stream; name=fix_do_pg_abort_backup_v2.patchDownload+11-2
#9Michael Paquier
michael@paquier.xyz
In reply to: Masahiko Sawada (#8)
Re: Assertion failure when the non-exclusive pg_stop_backup aborted.

On Fri, Sep 22, 2017 at 11:34 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

You're right. I updated the patch so that it exits do_pg_abort_backup
if the state is NONE and setting the state to NONE in
do_pg_stop_backup before releasing the WAL insert lock.

-    /* Clean up session-level lock */
+    /*
+     * Clean up session-level lock. To avoid interrupting before changing
+     * the backup state by LWLockWaitForVar we change it while holding the
+     * WAL insert lock.
+     */
     sessionBackupState = SESSION_BACKUP_NONE;
You could just mention directly CHECK_FOR_INTERRUPTS here.
+    /* Quick exit if we have done the backup */
+    if (XLogCtl->Insert.exclusiveBackupState == EXCLUSIVE_BACKUP_NONE)
+        return;

The patch contents look fine to me. I have also tested things in
depths but could not find any problems. I also looked again at the
backup start code, trying to see any flows between the moment the
session backup lock is changed and the moment the callback to do
backup cleanup is registered but I have not found any issues. I am
marking this as ready for committer.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Michael Paquier (#9)
Re: Assertion failure when the non-exclusive pg_stop_backup aborted.

On Fri, Sep 22, 2017 at 3:00 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Fri, Sep 22, 2017 at 11:34 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

You're right. I updated the patch so that it exits do_pg_abort_backup
if the state is NONE and setting the state to NONE in
do_pg_stop_backup before releasing the WAL insert lock.

-    /* Clean up session-level lock */
+    /*
+     * Clean up session-level lock. To avoid interrupting before changing
+     * the backup state by LWLockWaitForVar we change it while holding the
+     * WAL insert lock.
+     */
sessionBackupState = SESSION_BACKUP_NONE;
You could just mention directly CHECK_FOR_INTERRUPTS here.

Thank you for the reviewing.
I have a question. Since WALInsertLockRelease seems not to call
LWLockWaitForVar I thought you wanted to mean LWLockReleaseClearVar
instead, is that right?

+    /* Quick exit if we have done the backup */
+    if (XLogCtl->Insert.exclusiveBackupState == EXCLUSIVE_BACKUP_NONE)
+        return;

The patch contents look fine to me. I have also tested things in
depths but could not find any problems. I also looked again at the
backup start code, trying to see any flows between the moment the
session backup lock is changed and the moment the callback to do
backup cleanup is registered but I have not found any issues. I am
marking this as ready for committer.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11Michael Paquier
michael@paquier.xyz
In reply to: Masahiko Sawada (#10)
Re: Assertion failure when the non-exclusive pg_stop_backup aborted.

On Fri, Sep 22, 2017 at 3:24 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I have a question. Since WALInsertLockRelease seems not to call
LWLockWaitForVar I thought you wanted to mean LWLockReleaseClearVar
instead, is that right?

This looks like a copy-pasto from my side. Thanks for spotting it.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Michael Paquier (#11)
Re: Assertion failure when the non-exclusive pg_stop_backup aborted.

On Fri, Sep 22, 2017 at 3:46 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Fri, Sep 22, 2017 at 3:24 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I have a question. Since WALInsertLockRelease seems not to call
LWLockWaitForVar I thought you wanted to mean LWLockReleaseClearVar
instead, is that right?

This looks like a copy-pasto from my side. Thanks for spotting it.

Okay, attached the latest version patch.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachments:

fix_do_pg_abort_backup_v3.patchapplication/octet-stream; name=fix_do_pg_abort_backup_v3.patchDownload+11-2
#13Fujii Masao
masao.fujii@gmail.com
In reply to: Masahiko Sawada (#12)
Re: [HACKERS] Assertion failure when the non-exclusive pg_stop_backup aborted.

On Fri, Sep 22, 2017 at 4:42 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Fri, Sep 22, 2017 at 3:46 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Fri, Sep 22, 2017 at 3:24 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I have a question. Since WALInsertLockRelease seems not to call
LWLockWaitForVar I thought you wanted to mean LWLockReleaseClearVar
instead, is that right?

This looks like a copy-pasto from my side. Thanks for spotting it.

Okay, attached the latest version patch.

+ /* Quick exit if we have done the backup */
+ if (XLogCtl->Insert.exclusiveBackupState == EXCLUSIVE_BACKUP_NONE)
+ return;

This quick exit seems to cause another problem. Please imagine the
case where there is no exclusive backup running, a backend starts
non-exclusive backup and is terminated before calling pg_stop_backup().
In this case, do_pg_abort_backup() should decrement
XLogCtl->Insert.nonExclusiveBackups, but, with the patch, because of
the above quick exit, do_pg_abort_backup() fails to decrement that.

Regards,

--
Fujii Masao

#14Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Masahiko Sawada (#1)
Re: [HACKERS] Assertion failure when the non-exclusive pg_stop_backup aborted.

Thank you for the reviewing!

On Nov 15, 2017 2:59 AM, "Fujii Masao" <masao.fujii@gmail.com> wrote:

On Fri, Sep 22, 2017 at 4:42 PM, Masahiko Sawada <sawada.mshk@gmail.com>
wrote:

On Fri, Sep 22, 2017 at 3:46 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Fri, Sep 22, 2017 at 3:24 PM, Masahiko Sawada <sawada.mshk@gmail.com>

wrote:

I have a question. Since WALInsertLockRelease seems not to call
LWLockWaitForVar I thought you wanted to mean LWLockReleaseClearVar
instead, is that right?

This looks like a copy-pasto from my side. Thanks for spotting it.

Okay, attached the latest version patch.

+ /* Quick exit if we have done the backup */
+ if (XLogCtl->Insert.exclusiveBackupState == EXCLUSIVE_BACKUP_NONE)
+ return;

This quick exit seems to cause another problem. Please imagine the
case where there is no exclusive backup running, a backend starts
non-exclusive backup and is terminated before calling pg_stop_backup().
In this case, do_pg_abort_backup() should decrement
XLogCtl->Insert.nonExclusiveBackups, but, with the patch, because of
the above quick exit, do_pg_abort_backup() fails to decrement that.

Hmm, I think that in this case because exclusiveBackupState is not
EXCLUSIVE_BACKUP_NONE, nonExclusiveBackups is surely decremented. Am I
missing something?

Regards,

--
Masahiko Sawada

#15Michael Paquier
michael@paquier.xyz
In reply to: Masahiko Sawada (#14)
Re: [HACKERS] Assertion failure when the non-exclusive pg_stop_backup aborted.

On Wed, Nov 15, 2017 at 9:06 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Nov 15, 2017 2:59 AM, "Fujii Masao" <masao.fujii@gmail.com> wrote:
+ /* Quick exit if we have done the backup */
+ if (XLogCtl->Insert.exclusiveBackupState == EXCLUSIVE_BACKUP_NONE)
+ return;

This quick exit seems to cause another problem. Please imagine the
case where there is no exclusive backup running, a backend starts
non-exclusive backup and is terminated before calling pg_stop_backup().
In this case, do_pg_abort_backup() should decrement
XLogCtl->Insert.nonExclusiveBackups, but, with the patch, because of
the above quick exit, do_pg_abort_backup() fails to decrement that.

Hmm, I think that in this case because exclusiveBackupState is not
EXCLUSIVE_BACKUP_NONE, nonExclusiveBackups is surely decremented. Am I
missing something?

Nah. Fujii-san is right here as exclusiveBackupState is never updated
for non-exclusive backups. You need an extra check on
sessionBackupState to make sure that even for non-exclusive backups
the counter is correctly decremented if a non-exclusive session lock
is hold. For an exclusive backup, the session lock can be either
SESSION_BACKUP_EXCLUSIVE if an exclusive backup is stopped on the same
session as the start phase, or SESSION_BACKUP_NONE if the exclusive
backup is stopped from a different session. So you'd basically need
that:
+   /*
+    * Quick exit if we have done the exclusive backup or if session is
+    * not keeping around a started non-exclusive backup.
+    */
+   if (XLogCtl->Insert.exclusiveBackupState == EXCLUSIVE_BACKUP_NONE &&
+       sessionBackupState != SESSION_BACKUP_NON_EXCLUSIVE)
+       return;

At the same time it would be safer at startup phase to update
sessionBackupState when holding the WAL insert lock to prevent other
CHECK_FOR_INTERRUPTS hazards. Suggestion of patch attached.
--
Michael

Attachments:

fix_do_pg_abort_backup_v4.patchapplication/octet-stream; name=fix_do_pg_abort_backup_v4.patchDownload+22-3
#16Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Michael Paquier (#15)
Re: [HACKERS] Assertion failure when the non-exclusive pg_stop_backup aborted.

On Wed, Nov 15, 2017 at 10:05 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Wed, Nov 15, 2017 at 9:06 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Nov 15, 2017 2:59 AM, "Fujii Masao" <masao.fujii@gmail.com> wrote:
+ /* Quick exit if we have done the backup */
+ if (XLogCtl->Insert.exclusiveBackupState == EXCLUSIVE_BACKUP_NONE)
+ return;

This quick exit seems to cause another problem. Please imagine the
case where there is no exclusive backup running, a backend starts
non-exclusive backup and is terminated before calling pg_stop_backup().
In this case, do_pg_abort_backup() should decrement
XLogCtl->Insert.nonExclusiveBackups, but, with the patch, because of
the above quick exit, do_pg_abort_backup() fails to decrement that.

Hmm, I think that in this case because exclusiveBackupState is not
EXCLUSIVE_BACKUP_NONE, nonExclusiveBackups is surely decremented. Am I
missing something?

Nah. Fujii-san is right here as exclusiveBackupState is never updated
for non-exclusive backups.

Thank you, I was wrong.

You need an extra check on
sessionBackupState to make sure that even for non-exclusive backups
the counter is correctly decremented if a non-exclusive session lock
is hold. For an exclusive backup, the session lock can be either
SESSION_BACKUP_EXCLUSIVE if an exclusive backup is stopped on the same
session as the start phase, or SESSION_BACKUP_NONE if the exclusive
backup is stopped from a different session. So you'd basically need
that:
+   /*
+    * Quick exit if we have done the exclusive backup or if session is
+    * not keeping around a started non-exclusive backup.
+    */
+   if (XLogCtl->Insert.exclusiveBackupState == EXCLUSIVE_BACKUP_NONE &&
+       sessionBackupState != SESSION_BACKUP_NON_EXCLUSIVE)
+       return;

At the same time it would be safer at startup phase to update
sessionBackupState when holding the WAL insert lock to prevent other
CHECK_FOR_INTERRUPTS hazards. Suggestion of patch attached.

I think we need to check only sessionBackupState and don't need to
check XLogCtl->Insert.exclusiveBackupState in do_pg_abort_backup(). We
can quickly return if sessionBackupState !=
SESSION_BACKUP_NON_EXCLUSIVE. In your suggestion, I think we can still
get an assertion failure when pg_stop_backup(false) waiting for
archiving is terminated while concurrent an exclusive backup is in
progress.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#17Michael Paquier
michael@paquier.xyz
In reply to: Masahiko Sawada (#16)
Re: [HACKERS] Assertion failure when the non-exclusive pg_stop_backup aborted.

On Wed, Nov 15, 2017 at 12:12 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I think we need to check only sessionBackupState and don't need to
check XLogCtl->Insert.exclusiveBackupState in do_pg_abort_backup(). We
can quickly return if sessionBackupState !=
SESSION_BACKUP_NON_EXCLUSIVE. In your suggestion, I think we can still
get an assertion failure when pg_stop_backup(false) waiting for
archiving is terminated while concurrent an exclusive backup is in
progress.

I have just gone through the thread once again, and noticed that it is
actually what I suggested upthread:
/messages/by-id/CAB7nPqTm5CDrR5Y7yyfKy+PVDZ6dWS_jKG1KStaN5m95gAMTFQ@mail.gmail.com
But your v2 posted here did not do that so it is incorrect from the start:
/messages/by-id/CAD21AoA+isXYL1_ZXMnk9xJhYEL5h6rxJtTovLi7fumqfmCYgg@mail.gmail.com

We both got a bit confused here. As do_pg_abort_backup() is only used
for non-exclusive backups (including those taken through the
replication protocol), going through the session lock for checks is
fine. Could you update your patch accordingly please?
--
Michael

#18Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Michael Paquier (#17)
Re: [HACKERS] Assertion failure when the non-exclusive pg_stop_backup aborted.

On Wed, Nov 15, 2017 at 2:38 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Wed, Nov 15, 2017 at 12:12 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I think we need to check only sessionBackupState and don't need to
check XLogCtl->Insert.exclusiveBackupState in do_pg_abort_backup(). We
can quickly return if sessionBackupState !=
SESSION_BACKUP_NON_EXCLUSIVE. In your suggestion, I think we can still
get an assertion failure when pg_stop_backup(false) waiting for
archiving is terminated while concurrent an exclusive backup is in
progress.

I have just gone through the thread once again, and noticed that it is
actually what I suggested upthread:
/messages/by-id/CAB7nPqTm5CDrR5Y7yyfKy+PVDZ6dWS_jKG1KStaN5m95gAMTFQ@mail.gmail.com
But your v2 posted here did not do that so it is incorrect from the start:
/messages/by-id/CAD21AoA+isXYL1_ZXMnk9xJhYEL5h6rxJtTovLi7fumqfmCYgg@mail.gmail.com

Sorry, it's my fault. I didn't mean it but I forgot.

We both got a bit confused here. As do_pg_abort_backup() is only used
for non-exclusive backups (including those taken through the
replication protocol), going through the session lock for checks is
fine. Could you update your patch accordingly please?

One question is, since we need to check only the session lock I think
that the following change is not necessary. Even if calling
CHECK_FOR_INTERRUPTS after set sessionBackupState =
SESSION_BACKUP_EXCLUSIVE; we never call do_pg_abort_backup(). Is that
right?

@@ -10636,8 +10636,14 @@ do_pg_start_backup(const char *backupidstr,
bool fast, TimeLineID *starttli_p,
        {
                WALInsertLockAcquireExclusive();
                XLogCtl->Insert.exclusiveBackupState =
EXCLUSIVE_BACKUP_IN_PROGRESS;
-               WALInsertLockRelease();
+
+               /*
+                * Clean up session-level lock. To avoid calling
CHECK_FOR_INTERRUPTS
+                * by LWLockReleaseClearVar before changing the backup
state we change
+                * it while holding the WAL insert lock.
+                */
                sessionBackupState = SESSION_BACKUP_EXCLUSIVE;
+               WALInsertLockRelease();
        }
        else
                sessionBackupState = SESSION_BACKUP_NON_EXCLUSIVE;

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#19Michael Paquier
michael@paquier.xyz
In reply to: Masahiko Sawada (#18)
Re: [HACKERS] Assertion failure when the non-exclusive pg_stop_backup aborted.

On Wed, Nov 15, 2017 at 5:21 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Wed, Nov 15, 2017 at 2:38 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Wed, Nov 15, 2017 at 12:12 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I think we need to check only sessionBackupState and don't need to
check XLogCtl->Insert.exclusiveBackupState in do_pg_abort_backup(). We
can quickly return if sessionBackupState !=
SESSION_BACKUP_NON_EXCLUSIVE. In your suggestion, I think we can still
get an assertion failure when pg_stop_backup(false) waiting for
archiving is terminated while concurrent an exclusive backup is in
progress.

I have just gone through the thread once again, and noticed that it is
actually what I suggested upthread:
/messages/by-id/CAB7nPqTm5CDrR5Y7yyfKy+PVDZ6dWS_jKG1KStaN5m95gAMTFQ@mail.gmail.com
But your v2 posted here did not do that so it is incorrect from the start:
/messages/by-id/CAD21AoA+isXYL1_ZXMnk9xJhYEL5h6rxJtTovLi7fumqfmCYgg@mail.gmail.com

Sorry, it's my fault. I didn't mean it but I forgot.

My review was wrong as well :)

We both got a bit confused here. As do_pg_abort_backup() is only used
for non-exclusive backups (including those taken through the
replication protocol), going through the session lock for checks is
fine. Could you update your patch accordingly please?

One question is, since we need to check only the session lock I think
that the following change is not necessary. Even if calling
CHECK_FOR_INTERRUPTS after set sessionBackupState =
SESSION_BACKUP_EXCLUSIVE; we never call do_pg_abort_backup(). Is that
right?

Yeah, this one is not strictly necessary for this bug, but it seems to
me that it would be a good idea for robustness wiht interrupts to be
consistent with the stop phase when updating the session lock.
--
Michael

#20Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Michael Paquier (#19)
Re: [HACKERS] Assertion failure when the non-exclusive pg_stop_backup aborted.

On Wed, Nov 15, 2017 at 5:39 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Wed, Nov 15, 2017 at 5:21 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Wed, Nov 15, 2017 at 2:38 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Wed, Nov 15, 2017 at 12:12 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I think we need to check only sessionBackupState and don't need to
check XLogCtl->Insert.exclusiveBackupState in do_pg_abort_backup(). We
can quickly return if sessionBackupState !=
SESSION_BACKUP_NON_EXCLUSIVE. In your suggestion, I think we can still
get an assertion failure when pg_stop_backup(false) waiting for
archiving is terminated while concurrent an exclusive backup is in
progress.

I have just gone through the thread once again, and noticed that it is
actually what I suggested upthread:
/messages/by-id/CAB7nPqTm5CDrR5Y7yyfKy+PVDZ6dWS_jKG1KStaN5m95gAMTFQ@mail.gmail.com
But your v2 posted here did not do that so it is incorrect from the start:
/messages/by-id/CAD21AoA+isXYL1_ZXMnk9xJhYEL5h6rxJtTovLi7fumqfmCYgg@mail.gmail.com

Sorry, it's my fault. I didn't mean it but I forgot.

My review was wrong as well :)

We both got a bit confused here. As do_pg_abort_backup() is only used
for non-exclusive backups (including those taken through the
replication protocol), going through the session lock for checks is
fine. Could you update your patch accordingly please?

One question is, since we need to check only the session lock I think
that the following change is not necessary. Even if calling
CHECK_FOR_INTERRUPTS after set sessionBackupState =
SESSION_BACKUP_EXCLUSIVE; we never call do_pg_abort_backup(). Is that
right?

Yeah, this one is not strictly necessary for this bug, but it seems to
me that it would be a good idea for robustness wiht interrupts to be
consistent with the stop phase when updating the session lock.

Agreed. Attached the updated patch, please review it.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachments:

fix_do_pg_abort_backup_v5.patchapplication/octet-stream; name=fix_do_pg_abort_backup_v5.patchDownload+21-3
#21Michael Paquier
michael@paquier.xyz
In reply to: Masahiko Sawada (#20)
#22Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Michael Paquier (#21)
#23Michael Paquier
michael@paquier.xyz
In reply to: Masahiko Sawada (#22)
#24Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Michael Paquier (#23)
#25Fujii Masao
masao.fujii@gmail.com
In reply to: Masahiko Sawada (#24)
#26Michael Paquier
michael@paquier.xyz
In reply to: Fujii Masao (#25)
#27Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Michael Paquier (#26)
#28Michael Paquier
michael@paquier.xyz
In reply to: Masahiko Sawada (#27)
#29Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Michael Paquier (#28)
#30Robert Haas
robertmhaas@gmail.com
In reply to: Masahiko Sawada (#29)
#31Michael Paquier
michael@paquier.xyz
In reply to: Robert Haas (#30)
#32Michael Paquier
michael@paquier.xyz
In reply to: Michael Paquier (#31)
#33Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Michael Paquier (#32)
#34Michael Paquier
michael@paquier.xyz
In reply to: Masahiko Sawada (#33)
#35Robert Haas
robertmhaas@gmail.com
In reply to: Michael Paquier (#34)
#36Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Robert Haas (#35)
#37Michael Paquier
michael@paquier.xyz
In reply to: Masahiko Sawada (#36)
#38Fujii Masao
masao.fujii@gmail.com
In reply to: Michael Paquier (#37)
#39Michael Paquier
michael@paquier.xyz
In reply to: Fujii Masao (#38)
#40Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Michael Paquier (#39)
#41Robert Haas
robertmhaas@gmail.com
In reply to: Masahiko Sawada (#40)
#42Michael Paquier
michael@paquier.xyz
In reply to: Robert Haas (#41)
#43Fujii Masao
masao.fujii@gmail.com
In reply to: Masahiko Sawada (#40)
#44Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Fujii Masao (#43)
#45Michael Paquier
michael@paquier.xyz
In reply to: Masahiko Sawada (#44)