Some 9.5beta2 backend processes not terminating properly?
After setting up 9.5beta2 on the Npgsql build server and running the Npgsql
test suite against I've noticed some weird behavior.
The tests run for a couple minutes, open and close some connection. With my
pre-9.5 backends, the moment the test runner exits I can see that all
backend processes exit immediately, and pg_activity_stat has no rows
(except the querying one). With 9.5beta2, however, some backend processes
continue to stay alive beyond the test runner, and pg_activity_stat
contains extra rows (state idle, waiting false). This situation persists
until I restart PostgreSQL.
This happens consistently on two machines, running Windows 7 and Windows
10. Both client and server are on the same machine and use TCP to
communicate. I can investigate further and try to produce a more isolated
repro but I thought I'd talk to you guys first.
Any thoughts or ideas on what might cause this? Any suggestions for
tracking this down?
Shay
Shay Rojansky <roji@roji.org> writes:
After setting up 9.5beta2 on the Npgsql build server and running the Npgsql
test suite against I've noticed some weird behavior.
The tests run for a couple minutes, open and close some connection. With my
pre-9.5 backends, the moment the test runner exits I can see that all
backend processes exit immediately, and pg_activity_stat has no rows
(except the querying one). With 9.5beta2, however, some backend processes
continue to stay alive beyond the test runner, and pg_activity_stat
contains extra rows (state idle, waiting false). This situation persists
until I restart PostgreSQL.
No idea what's happening, but a couple of questions:
* Are you using SSL connections?
* Can you get stack traces from the seemingly-stuck backends?
https://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Windows
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
The tests run for a couple minutes, open and close some connection. With
my
pre-9.5 backends, the moment the test runner exits I can see that all
backend processes exit immediately, and pg_activity_stat has no rows
(except the querying one). With 9.5beta2, however, some backend processes
continue to stay alive beyond the test runner, and pg_activity_stat
contains extra rows (state idle, waiting false). This situation persists
until I restart PostgreSQL.No idea what's happening, but a couple of questions:
* Are you using SSL connections?
* Can you get stack traces from the seemingly-stuck backends?
Most of my tests don't use SSL but some do. Looking at the query field in
pg_stat_activity I can see queries that don't seem to originate from SSL
tests.
Note that the number of backends that stay stuck after the tests is
constant (always 12).
Here's are stack dumps of the same process taken with both VS2015 Community
and Process Explorer, I went over 4 processes and saw the same thing. Let
me know what I else I can provide to help.
From VS2015 Community:
Main Thread
ntdll.dll!NtWaitForMultipleObjects() Unknown
KernelBase.dll!WaitForMultipleObjectsEx() Unknown
KernelBase.dll!WaitForMultipleObjects() Unknown
postgres.exe!WaitLatchOrSocket(volatile Latch * latch, int wakeEvents,
unsigned __int64 sock, long timeout) Line 202 C
postgres.exe!secure_read(Port * port, void * ptr, unsigned __int64 len)
Line 151 C
postgres.exe!pq_getbyte() Line 926 C
postgres.exe!SocketBackend(StringInfoData * inBuf) Line 345 C
postgres.exe!PostgresMain(int argc, char * * argv, const char * dbname,
const char * username) Line 3984 C
postgres.exe!BackendRun(Port * port) Line 4236 C
postgres.exe!SubPostmasterMain(int argc, char * * argv) Line 4727 C
postgres.exe!main(int argc, char * * argv) Line 211 C
postgres.exe!__tmainCRTStartup() Line 626 C
kernel32.dll!BaseThreadInitThunk() Unknown
ntdll.dll!RtlUserThreadStart() Unknown
Worker Thread
ntdll.dll!NtWaitForWorkViaWorkerFactory() Unknown
ntdll.dll!TppWorkerThread() Unknown
kernel32.dll!BaseThreadInitThunk() Unknown
ntdll.dll!RtlUserThreadStart() Unknown
Worker Thread
ntdll.dll!NtFsControlFile() Unknown
KernelBase.dll!ConnectNamedPipe() Unknown
postgres.exe!pg_signal_thread(void * param) Line 279 C
kernel32.dll!BaseThreadInitThunk() Unknown
ntdll.dll!RtlUserThreadStart() Unknown
Worker Thread
ntdll.dll!NtWaitForSingleObject() Unknown
KernelBase.dll!WaitForSingleObjectEx() Unknown
postgres.exe!pg_timer_thread(void * param) Line 49 C
kernel32.dll!BaseThreadInitThunk() Unknown
ntdll.dll!RtlUserThreadStart() Unknown
From Process Explorer (slightly different):
ntoskrnl.exe!KeSynchronizeExecution+0x3de6
ntoskrnl.exe!KeWaitForSingleObject+0xc7a
ntoskrnl.exe!KeWaitForSingleObject+0x709
ntoskrnl.exe!KeWaitForSingleObject+0x375
ntoskrnl.exe!IoQueueWorkItem+0x370
ntoskrnl.exe!KeRemoveQueueEx+0x16ba
ntoskrnl.exe!KeWaitForSingleObject+0xe8e
ntoskrnl.exe!KeWaitForSingleObject+0x709
ntoskrnl.exe!KeWaitForMultipleObjects+0x24e
ntoskrnl.exe!ObWaitForMultipleObjects+0x2bd
ntoskrnl.exe!IoWMIRegistrationControl+0x2402
ntoskrnl.exe!setjmpex+0x3943
ntdll.dll!NtWaitForMultipleObjects+0x14
KERNELBASE.dll!WaitForMultipleObjectsEx+0xef
KERNELBASE.dll!WaitForMultipleObjects+0xe
postgres.exe!WaitLatchOrSocket+0x243
postgres.exe!secure_read+0xb0
postgres.exe!pq_getbyte+0xec
postgres.exe!get_stats_option_name+0x392
postgres.exe!PostgresMain+0x537
postgres.exe!ShmemBackendArrayAllocation+0x2a6a
postgres.exe!SubPostmasterMain+0x273
postgres.exe!main+0x480
postgres.exe!pgwin32_popen+0x130b
KERNEL32.DLL!BaseThreadInitThunk+0x22
ntdll.dll!RtlUserThreadStart+0x34
On 2015-12-29 12:41:40 +0200, Shay Rojansky wrote:
The tests run for a couple minutes, open and close some connection. With
my
pre-9.5 backends, the moment the test runner exits I can see that all
backend processes exit immediately, and pg_activity_stat has no rows
(except the querying one). With 9.5beta2, however, some backend processes
continue to stay alive beyond the test runner, and pg_activity_stat
contains extra rows (state idle, waiting false). This situation persists
until I restart PostgreSQL.
Could you describe the worklad a bit more? Is this rather concurrent? Do
you use optimized or debug builds? How long did you wait for the
backends to die? Is this all over localhost, external ip but local,
remotely?
Note that the number of backends that stay stuck after the tests is
constant (always 12).
Can you increase the number of backends used in the test? And check
whether it's still 12?
Here's are stack dumps of the same process taken with both VS2015 Community
and Process Explorer, I went over 4 processes and saw the same thing. Let
me know what I else I can provide to help.From VS2015 Community:
Main Thread
ntdll.dll!NtWaitForMultipleObjects() Unknown
KernelBase.dll!WaitForMultipleObjectsEx() Unknown
KernelBase.dll!WaitForMultipleObjects() Unknown
postgres.exe!WaitLatchOrSocket(volatile Latch * latch, int wakeEvents,
unsigned __int64 sock, long timeout) Line 202 C
postgres.exe!secure_read(Port * port, void * ptr, unsigned __int64 len)
Line 151 C
postgres.exe!pq_getbyte() Line 926 C
postgres.exe!SocketBackend(StringInfoData * inBuf) Line 345 C
postgres.exe!PostgresMain(int argc, char * * argv, const char * dbname,
const char * username) Line 3984 C
postgres.exe!BackendRun(Port * port) Line 4236 C
postgres.exe!SubPostmasterMain(int argc, char * * argv) Line 4727 C
postgres.exe!main(int argc, char * * argv) Line 211 C
postgres.exe!__tmainCRTStartup() Line 626 C
kernel32.dll!BaseThreadInitThunk() Unknown
ntdll.dll!RtlUserThreadStart() Unknown
Hm. So we're waiting for the latch, and expecting to get a FD_CLOSE
error back because the socket is actually closed. Which should happen
always in that path - a read through win32_latch.c doesn't show any
obvious problems. But then I really have not too much clue about windows
development.
How are your clients disconnecting? Possibly without properly
disconnecting?
Regards,
Andres
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Could you describe the worklad a bit more? Is this rather concurrent? Do
you use optimized or debug builds? How long did you wait for the
backends to die? Is this all over localhost, external ip but local,
remotely?
The workload is a a rather diverse set of integration tests executed with
Npgsql. There's no concurrency whatsoever - tests are executed serially.
The backends stay alive indefinitely, until they are killed. All this is
over localhost with TCP. I can try other scenarios if that'll help.
Note that the number of backends that stay stuck after the tests is
constant (always 12).Can you increase the number of backends used in the test? And check
whether it's still 12?
Well, I ran the testsuite twice in parallel, and got... 23 backends stuck
at the end.
How are your clients disconnecting? Possibly without properly
disconnecting?
That's possible, definitely in some of the test cases.
What I can do is try to isolate things further by playing around with the
tests and trying to see if a more minimal repro can be done - I'll try
doing this later today or tomorrow. If anyone has any other specific tests
or checks I should do let me know.
On Tue, Dec 29, 2015 at 7:04 PM, Shay Rojansky <roji@roji.org> wrote:
Could you describe the worklad a bit more? Is this rather concurrent? Do
you use optimized or debug builds? How long did you wait for the
backends to die? Is this all over localhost, external ip but local,
remotely?The workload is a a rather diverse set of integration tests executed with
Npgsql. There's no concurrency whatsoever - tests are executed serially.
The backends stay alive indefinitely, until they are killed. All this is
over localhost with TCP. I can try other scenarios if that'll help.
What procedure do you use to kill backends? Normally, if we kill
via task manager using "End Process", it is considered as backend
crash and the server gets restarted and all other backends got
disconnected.
Note that the number of backends that stay stuck after the tests is
constant (always 12).
Can you increase the number of backends used in the test? And check
whether it's still 12?Well, I ran the testsuite twice in parallel, and got... 23 backends stuck
at the end.How are your clients disconnecting? Possibly without properly
disconnecting?That's possible, definitely in some of the test cases.
What I can do is try to isolate things further by playing around with the
tests and trying to see if a more minimal repro can be done - I'll try
doing this later today or tomorrow. If anyone has any other specific tests
or checks I should do let me know.
I think first we should try to isolate whether the hanged backends
are due to the reason that they are not disconnected properly or
there is some other factor involved as well, so you can try to kill/
disconnect the sessions connected via psql in the same way as
you are doing for connections with Npgsql and see if you can
reproduce the same behaviour.
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
OK, I finally found some time to dive into this.
The backends seem to hang when the client closes a socket without first
sending a Terminate message - some of the tests make this happen. I've
confirmed this happens with 9.5rc1 running on Windows (versions 10 and 7),
but this does not occur on Ubuntu 15.10. The client runs on Windows as well
(although I doubt that's important).
In case it helps, here's a gist
<https://gist.github.com/roji/33df4e818c5d64a607aa> with some .NET code
that uses Npgsql 3.0.4 to reproduce this.
If there's anything else I can do please let me know.
Shay
On Wed, Dec 30, 2015 at 5:32 AM, Amit Kapila <amit.kapila16@gmail.com>
wrote:
Show quoted text
On Tue, Dec 29, 2015 at 7:04 PM, Shay Rojansky <roji@roji.org> wrote:
Could you describe the worklad a bit more? Is this rather concurrent? Do
you use optimized or debug builds? How long did you wait for the
backends to die? Is this all over localhost, external ip but local,
remotely?The workload is a a rather diverse set of integration tests executed with
Npgsql. There's no concurrency whatsoever - tests are executed serially.
The backends stay alive indefinitely, until they are killed. All this is
over localhost with TCP. I can try other scenarios if that'll help.What procedure do you use to kill backends? Normally, if we kill
via task manager using "End Process", it is considered as backend
crash and the server gets restarted and all other backends got
disconnected.Note that the number of backends that stay stuck after the tests is
constant (always 12).
Can you increase the number of backends used in the test? And check
whether it's still 12?Well, I ran the testsuite twice in parallel, and got... 23 backends stuck
at the end.How are your clients disconnecting? Possibly without properly
disconnecting?That's possible, definitely in some of the test cases.
What I can do is try to isolate things further by playing around with the
tests and trying to see if a more minimal repro can be done - I'll try
doing this later today or tomorrow. If anyone has any other specific tests
or checks I should do let me know.I think first we should try to isolate whether the hanged backends
are due to the reason that they are not disconnected properly or
there is some other factor involved as well, so you can try to kill/
disconnect the sessions connected via psql in the same way as
you are doing for connections with Npgsql and see if you can
reproduce the same behaviour.With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
Hi,
On 2015-12-30 19:01:10 +0200, Shay Rojansky wrote:
OK, I finally found some time to dive into this.
The backends seem to hang when the client closes a socket without first
sending a Terminate message - some of the tests make this happen. I've
confirmed this happens with 9.5rc1 running on Windows (versions 10 and 7),
but this does not occur on Ubuntu 15.10. The client runs on Windows as well
(although I doubt that's important).
Hm. So that seems to indicate that, on windows, we're not properly
recognizing dead sockets in the latch code. Could you check, IIRC with
netstat or something like it, in what state the connections are?
Any chance you could single-step through WaitLatchOrSocket() with a
debugger? Without additional information this is rather hard to
diagnose.
On Wed, Dec 30, 2015 at 5:32 AM, Amit Kapila <amit.kapila16@gmail.com>
wrote:What procedure do you use to kill backends? Normally, if we kill
via task manager using "End Process", it is considered as backend
crash and the server gets restarted and all other backends got
disconnected.
Unless I miss something major here the problem is clients disconnecting
and leaving backends hanging. The killing of backends only comes into
play after that's already the case.
Regards,
Andres
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Shay Rojansky <roji@roji.org> writes:
The backends seem to hang when the client closes a socket without first
sending a Terminate message - some of the tests make this happen. I've
confirmed this happens with 9.5rc1 running on Windows (versions 10 and 7),
but this does not occur on Ubuntu 15.10.
Nor OS X. Ugh. My first thought was that ac1d7945f broke this, but
that's only in HEAD not 9.5, so some earlier change must be responsible.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Andres Freund <andres@anarazel.de> writes:
On 2015-12-30 19:01:10 +0200, Shay Rojansky wrote:
The backends seem to hang when the client closes a socket without first
sending a Terminate message - some of the tests make this happen. I've
confirmed this happens with 9.5rc1 running on Windows (versions 10 and 7),
but this does not occur on Ubuntu 15.10. The client runs on Windows as well
(although I doubt that's important).
Hm. So that seems to indicate that, on windows, we're not properly
recognizing dead sockets in the latch code.
Or we just broke EOF detection on Windows sockets in general. It might be
worth checking if the problem appears on the client side; that is, given a
psql running on Windows, do local-equivalent-of-kill-9 on the connected
backend, and see if psql notices. (Hm, although if it's idle psql wouldn't
notice until you next try a command, so it might be hard to tell. Maybe
kill -9 while the backend is in process of a long query?)
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2015-12-30 12:30:43 -0500, Tom Lane wrote:
Nor OS X. Ugh. My first thought was that ac1d7945f broke this, but
that's only in HEAD not 9.5, so some earlier change must be responsible.
The backtrace in
http://archives.postgresql.org/message-id/CADT4RqBo79_0Vx%3D-%2By%3DnFv3zdnm_-CgGzbtSv9LhxrFEoYMVFg%40mail.gmail.com
seems to indicate that it's really WaitLatchOrSocket() not noticing the
socket is closed.
For a moment I had the theory that Port->sock might be invalid because
it somehow got closed. That'd then remove the socket from the waited-on
events, which would explain the behaviour. But afaics that's really only
possible via pq_init()'s on_proc_exit(socket_close, 0); And I can't see
how that could be reached.
FWIW, the
if (sock == PGINVALID_SOCKET)
wakeEvents &= ~(WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE);
block in both latch implementations looks like a problem waiting to happen.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
The backends seem to hang when the client closes a socket without first
sending a Terminate message - some of the tests make this happen. I've
confirmed this happens with 9.5rc1 running on Windows (versions 10 and7),
but this does not occur on Ubuntu 15.10. The client runs on Windows as
well
(although I doubt that's important).
Hm. So that seems to indicate that, on windows, we're not properly
recognizing dead sockets in the latch code. Could you check, IIRC with
netstat or something like it, in what state the connections are?
netstat shows the socket is in FIN_WAIT_2.
Any chance you could single-step through WaitLatchOrSocket() with a
debugger? Without additional information this is rather hard to
diagnose.
Uh I sure can, but I have no idea what to look for :) Anything specific?
Andres Freund <andres@anarazel.de> writes:
FWIW, the
if (sock == PGINVALID_SOCKET)
wakeEvents &= ~(WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE);
block in both latch implementations looks like a problem waiting to happen.
You think it should throw an error instead? Seems reasonable to me.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2015-12-30 19:38:23 +0200, Shay Rojansky wrote:
Hm. So that seems to indicate that, on windows, we're not properly
recognizing dead sockets in the latch code. Could you check, IIRC with
netstat or something like it, in what state the connections are?
netstat shows the socket is in FIN_WAIT_2.
Any chance you could single-step through WaitLatchOrSocket() with a
debugger? Without additional information this is rather hard to
diagnose.Uh I sure can, but I have no idea what to look for :) Anything
specific?
Things that'd be interesting:
1) what are the arguments passed to WaitLatchOrSocket(), most
importantly wakeEvents and sock
2) are we busy looping, or is WaitForMultipleObjects() blocking
endlessly
3) If you kill -9 (well, terminate in the task manager) a client, while
stepping serverside in WaitLatchOrSocket, does
WaitForMultipleObjects() return? If so, what paths are we taking?
Greetings,
Andres Freund
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2015-12-30 12:41:56 -0500, Tom Lane wrote:
Andres Freund <andres@anarazel.de> writes:
FWIW, the
if (sock == PGINVALID_SOCKET)
wakeEvents &= ~(WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE);
block in both latch implementations looks like a problem waiting to happen.You think it should throw an error instead? Seems reasonable to me.
Yea. Error or maybe just an assert. That path seems to always indicate
something having gone wrong.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Any chance you could single-step through WaitLatchOrSocket() with a
debugger? Without additional information this is rather hard to
diagnose.Uh I sure can, but I have no idea what to look for :) Anything
specific?Things that'd be interesting:
1) what are the arguments passed to WaitLatchOrSocket(), most
importantly wakeEvents and sock
2) are we busy looping, or is WaitForMultipleObjects() blocking
endlessly
3) If you kill -9 (well, terminate in the task manager) a client, while
stepping serverside in WaitLatchOrSocket, does
WaitForMultipleObjects() return? If so, what paths are we taking?
The process definitely isn't busy looping - zero CPU usage.
I'll try to set up debugging, it may take some time though (unfamiliar with
PostgreSQL internals and Windows debugging techniques).
Andres Freund <andres@anarazel.de> writes:
On 2015-12-30 12:30:43 -0500, Tom Lane wrote:
Nor OS X. Ugh. My first thought was that ac1d7945f broke this, but
that's only in HEAD not 9.5, so some earlier change must be responsible.
The backtrace in
http://archives.postgresql.org/message-id/CADT4RqBo79_0Vx%3D-%2By%3DnFv3zdnm_-CgGzbtSv9LhxrFEoYMVFg%40mail.gmail.com
seems to indicate that it's really WaitLatchOrSocket() not noticing the
socket is closed.
Right, and what I was wondering was whether adding the additional wait-for
condition had exposed some pre-existing flaw in the Windows latch code.
But that's not it, so we're left with the conclusion that we broke
something that used to work.
Are we sure this is a 9.5-only bug? Shay, can you try 9.4 branch tip
and see if it misbehaves? Can anyone else reproduce the problem?
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Things that'd be interesting:
1) what are the arguments passed to WaitLatchOrSocket(), most
importantly wakeEvents and sock
wakeEvents is 8387808 and so is sock.
Tom, this bug doesn't occur with 9.4.4 (will try to download 9.4.5 and
test).
On 2015-12-30 12:50:58 -0500, Tom Lane wrote:
Right, and what I was wondering was whether adding the additional wait-for
condition had exposed some pre-existing flaw in the Windows latch code.
But that's not it, so we're left with the conclusion that we broke
something that used to work.
4bad60e is another suspect. Besides wondering why I moved the FD_CLOSE
case out of the existing if cases, I don't see anything suspicious
though. If we were hitting the write-path here, it'd be plausible that
we're hitting an issue with FD_CLOSE and waiting for writability; but
we're not.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers