Funny hang on PostgreSQL 10 during parallel index scan on slave
Hi all;
For the last few months we have been facing a funny problem on a slave
where queries go to 100% cpu usage and never finish, causing the recovery
process to hang and the replica to fall behind, Over time we ruled out a
lot of causes and were banging our heads against this one. Today we got a
break in it when we attached a debugger to various processes even without
debugging symbols. Not only did we get useful stack traces from the hung
query but attaching a debugger to the startup process caused the query to
finish. This has shown up in 10.2 and 10.5.
Based on the stack traces we have concluded the following seems to happen:
1. The query is in a parallel index scan or similar
2. A process is executing a parallel plan and allocating a significant
chunk of memory (2MB for example) in dynamic shared memory.
3. The startup process goes into a loop where it sends a sigusr1, sleeps
5m, and sends another sigusr1 etc.
4. The sigusr1 aborts the system call, which is then retried.
5. Because the system call takes more than 5ms, we end up in an endless
loop
I see one of two possible solutions here.
1. Exponential backoff in sending signals to maybe 1s max.
2. If there is any way to check for signals before retrying the system
call (which I am not 100% sure where it is yet but on my way).
Any feedback or thoughts before we look at implementing a patch?
--
Best Regards,
Chris Travers
Head of Database
Tel: +49 162 9037 210 | Skype: einhverfr | www.adjust.com
Saarbrücker Straße 37a, 10405 Berlin
On Wed, Sep 5, 2018 at 8:23 AM Chris Travers <chris.travers@adjust.com> wrote:
1. The query is in a parallel index scan or similar
2. A process is executing a parallel plan and allocating a significant chunk of memory (2MB for example) in dynamic shared memory.
3. The startup process goes into a loop where it sends a sigusr1, sleeps 5m, and sends another sigusr1 etc.
4. The sigusr1 aborts the system call, which is then retried.
5. Because the system call takes more than 5ms, we end up in an endless loop
Do you mean this loop in dsm_impl_posix_resize() is getting
interrupted constantly and never completing?
/* We may get interrupted, if so just retry. */
do
{
rc = posix_fallocate(fd, 0, size);
} while (rc == EINTR);
--
Thomas Munro
http://www.enterprisedb.com
Yep, Maybe we should check for signals there.
On Wed, Sep 5, 2018 at 5:27 PM Thomas Munro <thomas.munro@enterprisedb.com>
wrote:
On Wed, Sep 5, 2018 at 8:23 AM Chris Travers <chris.travers@adjust.com>
wrote:1. The query is in a parallel index scan or similar
2. A process is executing a parallel plan and allocating a significantchunk of memory (2MB for example) in dynamic shared memory.
3. The startup process goes into a loop where it sends a sigusr1,
sleeps 5m, and sends another sigusr1 etc.
4. The sigusr1 aborts the system call, which is then retried.
5. Because the system call takes more than 5ms, we end up in an endlessloop
Do you mean this loop in dsm_impl_posix_resize() is getting
interrupted constantly and never completing?/* We may get interrupted, if so just retry. */
do
{
rc = posix_fallocate(fd, 0, size);
} while (rc == EINTR);--
Thomas Munro
http://www.enterprisedb.com
--
Best Regards,
Chris Travers
Head of Database
Tel: +49 162 9037 210 | Skype: einhverfr | www.adjust.com
Saarbrücker Straße 37a, 10405 Berlin
Will submit a patch here shortly. Thanks! Should we do for master and
10? Or 9.6 too?
On Wed, Sep 5, 2018 at 6:40 PM Chris Travers <chris.travers@adjust.com>
wrote:
Yep, Maybe we should check for signals there.
On Wed, Sep 5, 2018 at 5:27 PM Thomas Munro <thomas.munro@enterprisedb.com>
wrote:On Wed, Sep 5, 2018 at 8:23 AM Chris Travers <chris.travers@adjust.com>
wrote:1. The query is in a parallel index scan or similar
2. A process is executing a parallel plan and allocating a significantchunk of memory (2MB for example) in dynamic shared memory.
3. The startup process goes into a loop where it sends a sigusr1,
sleeps 5m, and sends another sigusr1 etc.
4. The sigusr1 aborts the system call, which is then retried.
5. Because the system call takes more than 5ms, we end up in anendless loop
Do you mean this loop in dsm_impl_posix_resize() is getting
interrupted constantly and never completing?/* We may get interrupted, if so just retry. */
do
{
rc = posix_fallocate(fd, 0, size);
} while (rc == EINTR);--
Thomas Munro
http://www.enterprisedb.com--
Best Regards,
Chris Travers
Head of DatabaseTel: +49 162 9037 210 | Skype: einhverfr | www.adjust.com
Saarbrücker Straße 37a, 10405 Berlin
--
Best Regards,
Chris Travers
Head of Database
Tel: +49 162 9037 210 | Skype: einhverfr | www.adjust.com
Saarbrücker Straße 37a, 10405 Berlin
On Wed, Sep 5, 2018 at 9:49 AM Chris Travers <chris.travers@adjust.com> wrote:
Will submit a patch here shortly. Thanks! Should we do for master and 10? Or 9.6 too?
On Wed, Sep 5, 2018 at 6:40 PM Chris Travers <chris.travers@adjust.com> wrote:
Yep, Maybe we should check for signals there.
Yeah, it seems reasonable to check for interrupts here. It would need
to be back-patched as far as 9.4.
--
Thomas Munro
http://www.enterprisedb.com
Hi,
On 2018-09-05 18:48:44 +0200, Chris Travers wrote:
Will submit a patch here shortly. Thanks! Should we do for master and
10? Or 9.6 too?
Please don't top-post on this list. This needs to be done in all
branches where the posix_fallocate call is present.
Yep, Maybe we should check for signals there.
On Wed, Sep 5, 2018 at 5:27 PM Thomas Munro <thomas.munro@enterprisedb.com>
wrote:On Wed, Sep 5, 2018 at 8:23 AM Chris Travers <chris.travers@adjust.com>
wrote:1. The query is in a parallel index scan or similar
2. A process is executing a parallel plan and allocating a significantchunk of memory (2MB for example) in dynamic shared memory.
3. The startup process goes into a loop where it sends a sigusr1,
sleeps 5m, and sends another sigusr1 etc.
4. The sigusr1 aborts the system call, which is then retried.
5. Because the system call takes more than 5ms, we end up in anendless loop
What you're presumably encountering here is a recovery conflict.
On Wed, Sep 5, 2018 at 6:40 PM Chris Travers <chris.travers@adjust.com>
wrote:Do you mean this loop in dsm_impl_posix_resize() is getting
interrupted constantly and never completing?/* We may get interrupted, if so just retry. */
do
{
rc = posix_fallocate(fd, 0, size);
} while (rc == EINTR);
Probably worthwile to check that the dsm code is properly robust if
errors are thrown from within here.
Greetings,
Andres Freund
On Wed, Sep 5, 2018 at 6:55 PM Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2018-09-05 18:48:44 +0200, Chris Travers wrote:
Will submit a patch here shortly. Thanks! Should we do for master and
10? Or 9.6 too?Please don't top-post on this list. This needs to be done in all
branches where the posix_fallocate call is present.Yep, Maybe we should check for signals there.
On Wed, Sep 5, 2018 at 5:27 PM Thomas Munro <
thomas.munro@enterprisedb.com>
wrote:
On Wed, Sep 5, 2018 at 8:23 AM Chris Travers <
chris.travers@adjust.com>
wrote:
1. The query is in a parallel index scan or similar
2. A process is executing a parallel plan and allocating asignificant
chunk of memory (2MB for example) in dynamic shared memory.
3. The startup process goes into a loop where it sends a sigusr1,
sleeps 5m, and sends another sigusr1 etc.
4. The sigusr1 aborts the system call, which is then retried.
5. Because the system call takes more than 5ms, we end up in anendless loop
What you're presumably encountering here is a recovery conflict.
Agreed but the question is how to correct what is a fairly interesting race
condition.
On Wed, Sep 5, 2018 at 6:40 PM Chris Travers <chris.travers@adjust.com>
wrote:Do you mean this loop in dsm_impl_posix_resize() is getting
interrupted constantly and never completing?/* We may get interrupted, if so just retry. */
do
{
rc = posix_fallocate(fd, 0, size);
} while (rc == EINTR);Probably worthwile to check that the dsm code is properly robust if
errors are thrown from within here.
Will check that too. Thanks!
Greetings,
Andres Freund
--
Best Regards,
Chris Travers
Head of Database
Tel: +49 162 9037 210 | Skype: einhverfr | www.adjust.com
Saarbrücker Straße 37a, 10405 Berlin
On Wed, Sep 5, 2018 at 10:13 AM Chris Travers <chris.travers@adjust.com> wrote:
On Wed, Sep 5, 2018 at 6:55 PM Andres Freund <andres@anarazel.de> wrote:
On Wed, Sep 5, 2018 at 6:40 PM Chris Travers <chris.travers@adjust.com>
wrote:Do you mean this loop in dsm_impl_posix_resize() is getting
interrupted constantly and never completing?/* We may get interrupted, if so just retry. */
do
{
rc = posix_fallocate(fd, 0, size);
} while (rc == EINTR);Probably worthwile to check that the dsm code is properly robust if
errors are thrown from within here.
Yeah, currently dsm_impl_posix_resize() returns and lets
dsm_impl_posix() clean up (close(), shm_unlink()) before raising
errors. We can't just let CHECK_FOR_INTERRUPTS() take a non-local
exit. Some refactoring involving PG_TRY()/PG_CATCH() may be the
simplest way forward.
--
Thomas Munro
http://www.enterprisedb.com
Ok, so here's my current patch (work in progress). I have not run tests
yet, and finding a way to add a test case is virtually impossible though I
expect we will find ways of using gdb to confirm local fix later.
After careful code review, I settled on the following approach which seemed
to be the least intrusive.
1. Change the retry logic so that it does not retry of QueryCancelPending
or ProcDiePending are set. In these cases EINTR is passed back to the
caller
2. Change the error handling logic of the caller so that EINTR is handled
by the next CHECK_FOR_INTERRUPTS() after cleanup.
This means that the file descriptor is now already closed and that we
handle this the same way we would with a ENOSPC. The reason for this is
that there are many places in the code where it is not clear what degree of
safety is present for throwing errors using ereport, and the patch needs to
be as unobtrusive as possible to facilitate back porting.
At this point the patch is for discussion. I have not even compiled it or
tested it but maybe there is something wrong with my approach so figured I
would send it out early.
The major assumptions are:
1. close() will never take longer than the interrupt interval that caused
the loop to break.
2. close() does not get interrupted in a way that will not cause cleanup
problems later.
3. We will reach the next interrupt check at a reasonable interval and safe
spot.
Any initial arguments against?
--
Best Regards,
Chris Travers
Head of Database
Tel: +49 162 9037 210 | Skype: einhverfr | www.adjust.com
Saarbrücker Straße 37a, 10405 Berlin
Attachments:
racecondition.patchapplication/octet-stream; name=racecondition.patchDownload
diff --git a/src/backend/storage/ipc/dsm_impl.c b/src/backend/storage/ipc/dsm_impl.c
index 77e1bab54b..738fd2a802 100644
--- a/src/backend/storage/ipc/dsm_impl.c
+++ b/src/backend/storage/ipc/dsm_impl.c
@@ -330,10 +330,17 @@ dsm_impl_posix(dsm_op op, dsm_handle handle, Size request_size,
shm_unlink(name);
errno = save_errno;
- ereport(elevel,
+ /* If we received a query cancel or termination signal, we will have EINTR set here
+ * It may not be safe to handle interrupts here of this sort so we just continue to the
+ * next point where it is.
+ */
+ if (errno == EINTR)
+ ereport(elevel,
(errcode_for_dynamic_shared_memory(),
errmsg("could not resize shared memory segment \"%s\" to %zu bytes: %m",
name, request_size)));
+
+ /* This will throw appropriate errors for query cancel or termination */
return false;
}
@@ -419,11 +426,17 @@ dsm_impl_posix_resize(int fd, off_t size)
#if defined(HAVE_POSIX_FALLOCATE) && defined(__linux__)
if (rc == 0)
{
- /* We may get interrupted, if so just retry. */
+ /* We may get interrupted, if so just retry unless certaub signals are sent
+ *
+ * In rare cases it is possible for a signal loop to perpetually pre-empt this call
+ * So where the signal would cause the query to cancel or the process to terminate
+ * we exit the loop and return the interupted error to the caller who is now responsible
+ * for that.
+ */
do
{
rc = posix_fallocate(fd, 0, size);
- } while (rc == EINTR);
+ } while (rc == EINTR && !(ProcDiePending || QueryCancelPending));
/*
* The caller expects errno to be set, but posix_fallocate() doesn't
On Thu, Sep 6, 2018 at 11:08 AM Chris Travers <chris.travers@adjust.com>
wrote:
Ok, so here's my current patch (work in progress). I have not run tests
yet, and finding a way to add a test case is virtually impossible though I
expect we will find ways of using gdb to confirm local fix later.After careful code review, I settled on the following approach which
seemed to be the least intrusive.1. Change the retry logic so that it does not retry of QueryCancelPending
or ProcDiePending are set. In these cases EINTR is passed back to the
caller
2. Change the error handling logic of the caller so that EINTR is handled
by the next CHECK_FOR_INTERRUPTS() after cleanup.This means that the file descriptor is now already closed and that we
handle this the same way we would with a ENOSPC. The reason for this is
that there are many places in the code where it is not clear what degree of
safety is present for throwing errors using ereport, and the patch needs to
be as unobtrusive as possible to facilitate back porting.At this point the patch is for discussion. I have not even compiled it or
tested it but maybe there is something wrong with my approach so figured I
would send it out early.The major assumptions are:
1. close() will never take longer than the interrupt interval that caused
the loop to break.
2. close() does not get interrupted in a way that will not cause cleanup
problems later.
3. We will reach the next interrupt check at a reasonable interval and
safe spot.Any initial arguments against?
The previous patch had two issues found on internal code review here. I am
sending a new patch.
This patch compiles. make check passes
make check-world fails but the errors are the same as found on master and
involve ecpg.
We will be doing further internal review and verification and then will run
through pg_indent before final submission.
Changes in this patch:
1. Previous patch had backwards check for EINTR in calling function. This
was fixed.
2. In cases where ERROR elevel was passed in, function would return
instead of error out on case of error.
So in this case we check if we expect to error on error and if so, check
for interrupts. If not, we go through the normal error handling/logging
path which might result in an warning that shared memory segment could not
be allocated followed by an actual meaningful error. I could put that in
an else if, if that is seen as a good idea, but figured I would raise it as
a discussion point.
--
Best Regards,
Chris Travers
Head of DatabaseTel: +49 162 9037 210 | Skype: einhverfr | www.adjust.com
Saarbrücker Straße 37a, 10405 Berlin
--
Best Regards,
Chris Travers
Head of Database
Tel: +49 162 9037 210 | Skype: einhverfr | www.adjust.com
Saarbrücker Straße 37a, 10405 Berlin
Attachments:
racecondition.patchapplication/octet-stream; name=racecondition.patchDownload
diff --git a/77e1bab54b b/738fd2a802
index 77e1bab54b..738fd2a802 100644
--- a/77e1bab54b
+++ b/738fd2a802
@@ -330,10 +330,17 @@ dsm_impl_posix(dsm_op op, dsm_handle handle, Size request_size,
shm_unlink(name);
errno = save_errno;
- ereport(elevel,
+ /* If we received a query cancel or termination signal, we will have EINTR set here
+ * It may not be safe to handle interrupts here of this sort so we just continue to the
+ * next point where it is.
+ */
+ if (errno == EINTR)
+ ereport(elevel,
(errcode_for_dynamic_shared_memory(),
errmsg("could not resize shared memory segment \"%s\" to %zu bytes: %m",
name, request_size)));
+
+ /* This will throw appropriate errors for query cancel or termination */
return false;
}
@@ -419,11 +426,17 @@ dsm_impl_posix_resize(int fd, off_t size)
#if defined(HAVE_POSIX_FALLOCATE) && defined(__linux__)
if (rc == 0)
{
- /* We may get interrupted, if so just retry. */
+ /* We may get interrupted, if so just retry unless certaub signals are sent
+ *
+ * In rare cases it is possible for a signal loop to perpetually pre-empt this call
+ * So where the signal would cause the query to cancel or the process to terminate
+ * we exit the loop and return the interupted error to the caller who is now responsible
+ * for that.
+ */
do
{
rc = posix_fallocate(fd, 0, size);
- } while (rc == EINTR);
+ } while (rc == EINTR && !(ProcDiePending || QueryCancelPending));
/*
* The caller expects errno to be set, but posix_fallocate() doesn't
On Thu, Sep 6, 2018 at 1:31 PM Chris Travers <chris.travers@adjust.com>
wrote:
On Thu, Sep 6, 2018 at 11:08 AM Chris Travers <chris.travers@adjust.com>
wrote:Ok, so here's my current patch (work in progress). I have not run tests
yet, and finding a way to add a test case is virtually impossible though I
expect we will find ways of using gdb to confirm local fix later.After careful code review, I settled on the following approach which
seemed to be the least intrusive.1. Change the retry logic so that it does not retry of
QueryCancelPending or ProcDiePending are set. In these cases EINTR is
passed back to the caller
2. Change the error handling logic of the caller so that EINTR is
handled by the next CHECK_FOR_INTERRUPTS() after cleanup.This means that the file descriptor is now already closed and that we
handle this the same way we would with a ENOSPC. The reason for this is
that there are many places in the code where it is not clear what degree of
safety is present for throwing errors using ereport, and the patch needs to
be as unobtrusive as possible to facilitate back porting.At this point the patch is for discussion. I have not even compiled it
or tested it but maybe there is something wrong with my approach so figured
I would send it out early.The major assumptions are:
1. close() will never take longer than the interrupt interval that
caused the loop to break.
2. close() does not get interrupted in a way that will not cause cleanup
problems later.
3. We will reach the next interrupt check at a reasonable interval and
safe spot.Any initial arguments against?
The previous patch had two issues found on internal code review here. I
am sending a new patch.This patch compiles. make check passes
make check-world fails but the errors are the same as found on master and
involve ecpg.We will be doing further internal review and verification and then will
run through pg_indent before final submission.Changes in this patch:
1. Previous patch had backwards check for EINTR in calling function.
This was fixed.
2. In cases where ERROR elevel was passed in, function would return
instead of error out on case of error.So in this case we check if we expect to error on error and if so, check
for interrupts. If not, we go through the normal error handling/logging
path which might result in an warning that shared memory segment could not
be allocated followed by an actual meaningful error. I could put that in
an else if, if that is seen as a good idea, but figured I would raise it as
a discussion point.
Attaching correct patch....
--
Best Regards,
Chris Travers
Head of DatabaseTel: +49 162 9037 210 | Skype: einhverfr | www.adjust.com
Saarbrücker Straße 37a, 10405 Berlin--
Best Regards,
Chris Travers
Head of DatabaseTel: +49 162 9037 210 | Skype: einhverfr | www.adjust.com
Saarbrücker Straße 37a, 10405 Berlin
--
Best Regards,
Chris Travers
Head of Database
Tel: +49 162 9037 210 | Skype: einhverfr | www.adjust.com
Saarbrücker Straße 37a, 10405 Berlin
Attachments:
racecondition.patchapplication/octet-stream; name=racecondition.patchDownload
commit c92aedb1f7584e7f3e84599b2af4d00fbead99e8
Author: Chris Traverswq <chris.travers@gmail.com>
Date: Thu Sep 6 12:09:08 2018 +0200
Fix for race condition between startup process and parallel worker.
THe startup process can send a SIGUSR1 signal every 5ms and this can cause
the parallel worker to never complete the syscall that it continues to retry.
THe fix checks for pending interrupts and addresses program flow accordingly,
treating an incoming SIGUSR1, SIGTERM, and similar as subject to cleanup similar
to an ENOSPC.
diff --git a/src/backend/storage/ipc/dsm_impl.c b/src/backend/storage/ipc/dsm_impl.c
index 77e1bab54b..3929f3388d 100644
--- a/src/backend/storage/ipc/dsm_impl.c
+++ b/src/backend/storage/ipc/dsm_impl.c
@@ -47,6 +47,7 @@
*/
#include "postgres.h"
+#include "miscadmin.h"
#include <fcntl.h>
#include <unistd.h>
@@ -330,10 +331,24 @@ dsm_impl_posix(dsm_op op, dsm_handle handle, Size request_size,
shm_unlink(name);
errno = save_errno;
+ /* If we received a query cancel or termination signal, we will have EINTR set here
+ * It may not be safe to handle interrupts here of this sort so we just continue to the
+ * next point where it is.
+ *
+ * However, the safety of exceptions here are determined by the elevel passed in by the function
+ * so if the elevel is ERROR we check for interrupts here if there is an EINTR errno raised by the resize
+ * operation. Otherwise we log the failure and move on to the next point where interrupts are
+ * checked. This allows for the caller to handle cleanup and move to a safe point first.
+ */
+ if (errno == EINTR && elevel == ERROR)
+ CHECK_FOR_INTERRUPTS();
+
ereport(elevel,
(errcode_for_dynamic_shared_memory(),
errmsg("could not resize shared memory segment \"%s\" to %zu bytes: %m",
name, request_size)));
+
+ /* This will throw appropriate errors for query cancel or termination */
return false;
}
@@ -419,11 +434,17 @@ dsm_impl_posix_resize(int fd, off_t size)
#if defined(HAVE_POSIX_FALLOCATE) && defined(__linux__)
if (rc == 0)
{
- /* We may get interrupted, if so just retry. */
+ /* We may get interrupted, if so just retry unless certain signals are sent
+ *
+ * In rare cases it is possible for a signal loop to perpetually pre-empt this call
+ * So where the signal would cause the query to cancel or the process to terminate
+ * we exit the loop and return the interrupted error to the caller who is now responsible
+ * for that.
+ */
do
{
rc = posix_fallocate(fd, 0, size);
- } while (rc == EINTR);
+ } while (rc == EINTR && !(ProcDiePending || QueryCancelPending));
/*
* The caller expects errno to be set, but posix_fallocate() doesn't
As a status note, the above patch has not been run through pg_indent and
while I have run make check-world on linux (passed) and a non-Linux system
(which failed both with and without my patch) I will be making one more
small revision before final submission unless anyone thinks I need to
change approaches.
Currently, all code paths that hit the resize call the relevant functions
with an ERROR elevel. So running CHECK_FOR_INTERRUPTS after the initial
cleanup is currently safe, but the surrounding function doesn't guarantee
safety so I don't think it is a good idea to raise exceptions there.
In above patch I test for whether the level is ERROR and check for
interrupts when that is true. However as a colleague pointed out to me, if
anyone ever calls this with FATAL or PANIC as the desired error level, that
would not be safe so I will be checking for whether the level is equal to
or greater than ERROR.
We will also go through a manual test using a debugger to try to test the
behavior in these cases and make sure the patch actually resolves our
problem.
Assuming no objections here and that the manual test works, you can expect
a formal submission in the next couple of days.
--
Best Regards,
Chris Travers
Head of DatabaseTel: +49 162 9037 210 | Skype: einhverfr | www.adjust.com
Saarbrücker Straße 37a, 10405 Berlin
--
Best Regards,
Chris Travers
Head of Database
Tel: +49 162 9037 210 | Skype: einhverfr | www.adjust.com
Saarbrücker Straße 37a, 10405 Berlin