"could not reattach to shared memory" on buildfarm member dory
So far, dory has failed three times with essentially identical symptoms:
2018-04-23 19:57:10.624 GMT [2240] FATAL: could not reattach to shared memory (key=0000000000000190, addr=00000000018E0000): error code 487
2018-04-23 15:57:10.657 EDT [8836] ERROR: lost connection to parallel worker
2018-04-23 15:57:10.657 EDT [8836] STATEMENT: select count(*) from tenk1 group by twenty;
2018-04-23 15:57:10.660 EDT [3820] LOG: background worker "parallel worker" (PID 2240) exited with exit code 1
Now how can this be? We've successfully reserved and released the address
range we want to use, so it *should* be free at the instant we try to map.
Another thing that seems curious, though it may just be an artifact of
not having many data points yet, is that these failures all occurred
during pg_upgradecheck. You'd think the "check" and "install-check"
steps would be equally vulnerable to the failure.
I guess the good news is that we're seeing this in a reasonably
reproducible fashion, so there's some hope of digging down to find
out the actual cause.
regards, tom lane
Greetings,
* Tom Lane (tgl@sss.pgh.pa.us) wrote:
So far, dory has failed three times with essentially identical symptoms:
2018-04-23 19:57:10.624 GMT [2240] FATAL: could not reattach to shared memory (key=0000000000000190, addr=00000000018E0000): error code 487
2018-04-23 15:57:10.657 EDT [8836] ERROR: lost connection to parallel worker
2018-04-23 15:57:10.657 EDT [8836] STATEMENT: select count(*) from tenk1 group by twenty;
2018-04-23 15:57:10.660 EDT [3820] LOG: background worker "parallel worker" (PID 2240) exited with exit code 1Now how can this be? We've successfully reserved and released the address
range we want to use, so it *should* be free at the instant we try to map.
Yeah, that's definitely interesting.
I guess the good news is that we're seeing this in a reasonably
reproducible fashion, so there's some hope of digging down to find
out the actual cause.
I've asked Heath to take a look at the system again and see if there's
any Windows logs or such that might help us understand what's happening.
AV was disabled on the box, so don't think it's that, at least.
Thanks!
Stephen
On Tue, Apr 24, 2018 at 11:18 AM, Stephen Frost <sfrost@snowman.net> wrote:
Greetings,
* Tom Lane (tgl@sss.pgh.pa.us) wrote:
So far, dory has failed three times with essentially identical symptoms:
2018-04-23 19:57:10.624 GMT [2240] FATAL: could not reattach to shared memory (key=0000000000000190, addr=00000000018E0000): error code 487
2018-04-23 15:57:10.657 EDT [8836] ERROR: lost connection to parallel worker
2018-04-23 15:57:10.657 EDT [8836] STATEMENT: select count(*) from tenk1 group by twenty;
2018-04-23 15:57:10.660 EDT [3820] LOG: background worker "parallel worker" (PID 2240) exited with exit code 1Now how can this be? We've successfully reserved and released the address
range we want to use, so it *should* be free at the instant we try to map.Yeah, that's definitely interesting.
I wondered if another thread with the right timing could map something
between the VirtualFree() and MapViewOfFileEx() calls, but we don't
create the Windows signal handling thread until a bit later. Could
there be any any other threads active in the process?
Maybe try asking what's mapped there with VirtualQueryEx() on failure?
--
Thomas Munro
http://www.enterprisedb.com
On Tue, Apr 24, 2018 at 11:37:33AM +1200, Thomas Munro wrote:
On Tue, Apr 24, 2018 at 11:18 AM, Stephen Frost <sfrost@snowman.net> wrote:
Greetings,
* Tom Lane (tgl@sss.pgh.pa.us) wrote:
So far, dory has failed three times with essentially identical symptoms:
2018-04-23 19:57:10.624 GMT [2240] FATAL: could not reattach to shared memory (key=0000000000000190, addr=00000000018E0000): error code 487
2018-04-23 15:57:10.657 EDT [8836] ERROR: lost connection to parallel worker
2018-04-23 15:57:10.657 EDT [8836] STATEMENT: select count(*) from tenk1 group by twenty;
2018-04-23 15:57:10.660 EDT [3820] LOG: background worker "parallel worker" (PID 2240) exited with exit code 1Now how can this be? We've successfully reserved and released the address
range we want to use, so it *should* be free at the instant we try to map.Yeah, that's definitely interesting.
I wondered if another thread with the right timing could map something
between the VirtualFree() and MapViewOfFileEx() calls, but we don't
create the Windows signal handling thread until a bit later. Could
there be any any other threads active in the process?Maybe try asking what's mapped there with VirtualQueryEx() on failure?
+1. An implementation of that:
/messages/by-id/20170403065106.GA2624300@tornado.leadboat.com
On Tue, Apr 24, 2018 at 1:18 AM, Stephen Frost <sfrost@snowman.net> wrote:
Greetings,
* Tom Lane (tgl@sss.pgh.pa.us) wrote:
So far, dory has failed three times with essentially identical symptoms:
2018-04-23 19:57:10.624 GMT [2240] FATAL: could not reattach to shared
memory (key=0000000000000190, addr=00000000018E0000): error code 487
2018-04-23 15:57:10.657 EDT [8836] ERROR: lost connection to parallel
worker
2018-04-23 15:57:10.657 EDT [8836] STATEMENT: select count(*) from
tenk1 group by twenty;
2018-04-23 15:57:10.660 EDT [3820] LOG: background worker "parallel
worker" (PID 2240) exited with exit code 1
Now how can this be? We've successfully reserved and released the
address
range we want to use, so it *should* be free at the instant we try to
map.
Yeah, that's definitely interesting.
I guess the good news is that we're seeing this in a reasonably
reproducible fashion, so there's some hope of digging down to find
out the actual cause.I've asked Heath to take a look at the system again and see if there's
any Windows logs or such that might help us understand what's happening.
AV was disabled on the box, so don't think it's that, at least.
Disabled or uninstalled?
Back when I was combating windows AV on a daily basis, this normally did
not have the same effect. Just disabling the AV didn't actually remove the
parts that caused issues, it just hid them. Actual uninstall is what was
required.
--
Magnus Hagander
Me: https://www.hagander.net/ <http://www.hagander.net/>
Work: https://www.redpill-linpro.com/ <http://www.redpill-linpro.com/>
On 24 April 2018 at 15:18, Magnus Hagander <magnus@hagander.net> wrote:
Back when I was combating windows AV on a daily basis, this normally did not
have the same effect. Just disabling the AV didn't actually remove the parts
that caused issues, it just hid them. Actual uninstall is what was required.
Yep. Specifically, they tended to inject kernel hooks and/or load hook
DLLs that did funky and often flakey things. Often with poor awareness
of things like multiple processes opening one file for write at the
same time.
I think I heard that MS has cleaned up the situation with AV
considerably by offering more kernel infrastructure for it, and
restricting what you can do in terms of kernel extensions etc. But I
don't know how much.
In any case, do you think dropping a minidump at the point of failure
might be informative? It should contain the full memory mapping
information. For this purpose we could just create a crashdumps/
directory then abort() when we detect the error, and have the
buildfarm stop processing until someone can grab the tempdir with the
dumps, binaries, .pdb files, etc.
src/backend/port/win32/crashdump.c doesn't expose a helper function to
do all the dbghelp.dll messing around and create a crashdump, it only
allows that to be done via a crash handler. But it might make sense to
break out the actual "write a crash dump" part to a separately
callable function. I've looked at doing this before, but always got
stuck with the apparent lack of support in gdb or lldb to be used as a
library for self-dumping. You can always shell out to gcore I guess...
but ew. Or we can fork() and abort() the forked child like
https://github.com/RuntimeTools/gencore does, but again, ew.
I was thinking that maybe the buildfarm could just create crashdumps/
automatically, but then we'd need to have support infrastructure for
recording the Pg binaries and .pdb files along with the dumps,
rotating them so we don't run out of space, etc etc.
--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Greetings,
* Magnus Hagander (magnus@hagander.net) wrote:
On Tue, Apr 24, 2018 at 1:18 AM, Stephen Frost <sfrost@snowman.net> wrote:
I've asked Heath to take a look at the system again and see if there's
any Windows logs or such that might help us understand what's happening.
AV was disabled on the box, so don't think it's that, at least.Disabled or uninstalled?
The only AV installed on the system is the "Windows Defender" thing, so
it's not some additional AV system.
Thanks!
Stephen
Noah Misch <noah@leadboat.com> writes:
On Tue, Apr 24, 2018 at 11:37:33AM +1200, Thomas Munro wrote:
Maybe try asking what's mapped there with VirtualQueryEx() on failure?
+1. An implementation of that:
/messages/by-id/20170403065106.GA2624300@tornado.leadboat.com
Not seeing any other work happening here, I pushed a little bit of
quick-hack investigation code. This is based on noting that
VirtualAllocEx is documented as rounding the allocation up to a page
boundary (4K), but there's nothing specific about whether or how much
CreateFileMapping or MapViewOfFileEx might round up. The observed
failures could be explained if those guys might eat more virtual
address space for the same request size as VirtualAllocEx does.
This is a stretch, for sure, but given the lack of any other theories
we might as well check it.
regards, tom lane
Greetings,
* Tom Lane (tgl@sss.pgh.pa.us) wrote:
Noah Misch <noah@leadboat.com> writes:
On Tue, Apr 24, 2018 at 11:37:33AM +1200, Thomas Munro wrote:
Maybe try asking what's mapped there with VirtualQueryEx() on failure?
+1. An implementation of that:
/messages/by-id/20170403065106.GA2624300@tornado.leadboat.comNot seeing any other work happening here, I pushed a little bit of
quick-hack investigation code. This is based on noting that
VirtualAllocEx is documented as rounding the allocation up to a page
boundary (4K), but there's nothing specific about whether or how much
CreateFileMapping or MapViewOfFileEx might round up. The observed
failures could be explained if those guys might eat more virtual
address space for the same request size as VirtualAllocEx does.
This is a stretch, for sure, but given the lack of any other theories
we might as well check it.
Sounds good to me. Just as an FYI, there are a couple folks taking a
look at the system and trying to figure out what's going on. We've seen
an Event ID 1530 error in the Windows Event log associated with
vctip.exe which Visual Studio was running with the build, but only
sometimes. When vctip.exe is being run and then finishes, it goes and
cleans things up which seems to be what's triggering the 1530 and that
appears to correllate with the failures, but hard to say if that's
really a smoking gun or is just coincidence.
Thanks!
Stephen
[ Thanks to Stephen for cranking up a continuous build loop on dory ]
Noah Misch <noah@leadboat.com> writes:
On Tue, Apr 24, 2018 at 11:37:33AM +1200, Thomas Munro wrote:
Maybe try asking what's mapped there with VirtualQueryEx() on failure?
+1. An implementation of that:
/messages/by-id/20170403065106.GA2624300@tornado.leadboat.com
So I tried putting in that code, and it turns the problem from something
that maybe happens in every third buildfarm run or so, to something that
happens at least a dozen times in a single "make check" step. This seems
to mean that either EnumProcessModules or GetModuleFileNameEx is itself
allocating memory, and sometimes that allocation comes out of the space
VirtualFree just freed :-(.
So we can't use those functions. We have however proven that no new
module gets loaded during VirtualFree or MapViewOfFileEx, so there
doesn't seem to be anything more to be learned from them anyway.
What it looks like to me is that MapViewOfFileEx allocates some memory and
sometimes that comes out of the wrong place. This is, um, unfortunate.
It also appears that VirtualFree might sometimes allocate some memory,
and that'd be even more unfortunate, but it's hard to be certain; the
blame might well fail on VirtualQuery instead. (Ain't Heisenbugs fun?)
The solution I was thinking about last night was to have
PGSharedMemoryReAttach call MapViewOfFileEx to map the shared memory
segment at an unspecified address, then unmap it, then call VirtualFree,
and finally call MapViewOfFileEx with the real target address. The idea
here is to get these various DLLs to set up any memory allocation pools
they're going to set up before we risk doing VirtualFree. I am not,
at this point, convinced this will fix it :-( ... but I'm not sure what
else to try.
In any case, it's still pretty unclear why dory is showing this problem
and other buildfarm members are not. whelk for instance seems to be
loading all the same DLLs and more besides.
regards, tom lane
Greetings,
* Tom Lane (tgl@sss.pgh.pa.us) wrote:
[ Thanks to Stephen for cranking up a continuous build loop on dory ]
That was actually Heath, who is also trying to work this issue and has
an idea about something else which might help (and some more information
about what's happening in the event log). Adding him to the thread so
he can (easily) reply with what he's found.
Heath..?
Thanks!
Stephen
On Mon, Apr 30, 2018 at 2:34 PM, Stephen Frost <sfrost@snowman.net> wrote:
Greetings,
* Tom Lane (tgl@sss.pgh.pa.us) wrote:
[ Thanks to Stephen for cranking up a continuous build loop on dory ]
That was actually Heath, who is also trying to work this issue and has
an idea about something else which might help (and some more information
about what's happening in the event log). Adding him to the thread so
he can (easily) reply with what he's found.Heath..?
Thanks!
Stephen
So what I noticed after adding the '--force' flag was that in the Event
Viewer logs there was an Error in the System log stating that "The
application-specific permission settings do not grant Local Activation
permission for the COM Server application" for the Runtime Broker. So
around 2:00 pm today I went and changed the ownership of the registry
values to Administrators so I could add the user we are running the builds
under to the list of members that have access to it. However right after I
made that change the build was actually broken for me so I am just now
turning the builds back on to run constantly to verify if this has
any effect on the issue of not being able to reattach to the shared
memory. I am hoping that this makes things more stable, however I am not
confident that these are related.
The change that I attempted prior to this, which was done last Wednesday,
was to remove the vctip.exe program from being used as this was causing
issues for us as well. This was causing an Event ID 1530 error that stated
that "The application this is listed in the event details is leaving the
registry handle open" and Windows was helpfully closing any registry values
for that user profile, and I thought that possibly when doing so it was
cleaning up the memory that was allocated prior. However this did not
change anything for us after making that change.
Any ideas or changes that we could do to help debug or verify would be
helpful. We have considered changing it to run everything as SYSTEM but if
possible we would like to avoid this for security reasons. Thank you in
advance and I appreciate all the help.
-Heath
Greetings,
* Heath Lord (heath.lord@crunchydata.com) wrote:
Any ideas or changes that we could do to help debug or verify would be
helpful. We have considered changing it to run everything as SYSTEM but if
possible we would like to avoid this for security reasons. Thank you in
advance and I appreciate all the help.
Just to be clear- there is no longer anything showing up in the event
viewer associated with running the builds. There may still be an issue
with the system setup or such, but it at least seems less likely for
that to be the issue, so I'm thinking that Tom is more likely correct
that PG is doing something not quite right here.
Thanks!
Stephen
Heath Lord <heath.lord@crunchydata.com> writes:
So what I noticed after adding the '--force' flag was that in the Event
Viewer logs there was an Error in the System log stating that "The
application-specific permission settings do not grant Local Activation
permission for the COM Server application" for the Runtime Broker. So
around 2:00 pm today I went and changed the ownership of the registry
values to Administrators so I could add the user we are running the builds
under to the list of members that have access to it. However right after I
made that change the build was actually broken for me so I am just now
turning the builds back on to run constantly to verify if this has
any effect on the issue of not being able to reattach to the shared
memory. I am hoping that this makes things more stable, however I am not
confident that these are related.
The build was broken on Windows between about 12:30 and 14:30 EDT,
thanks to an unrelated change. Now that that's sorted, dory is
still failing :-(.
Moreover, even though I took out the module dump logic, the failure rate
is still a good bit higher than it was before, which seems to confirm my
fear that this is a Heisenbug: either VirtualQuery itself, or the act of
elog'ing a bunch of output, is causing memory allocations to take place
that otherwise would not have.
I'm hoping that the elog output is to blame for that, and am going to
go try to rejigger the code so that we capture the memory maps into space
that was allocated before VirtualFree.
Any ideas or changes that we could do to help debug or verify would be
helpful. We have considered changing it to run everything as SYSTEM but if
possible we would like to avoid this for security reasons.
Yeah, that's no solution. Even if it were OK for dory, end users wouldn't
necessarily want to run PG that way.
regards, tom lane
I wrote:
The solution I was thinking about last night was to have
PGSharedMemoryReAttach call MapViewOfFileEx to map the shared memory
segment at an unspecified address, then unmap it, then call VirtualFree,
and finally call MapViewOfFileEx with the real target address. The idea
here is to get these various DLLs to set up any memory allocation pools
they're going to set up before we risk doing VirtualFree. I am not,
at this point, convinced this will fix it :-( ... but I'm not sure what
else to try.
So the answer is that that doesn't help at all.
It's clear from dory's results that something is causing a 4MB chunk
of memory to get reserved in the process's address space, sometimes.
It might happen during the main MapViewOfFileEx call, or during the
preceding VirtualFree, or with my map/unmap dance in place, it might
happen during that. Frequently it doesn't happen at all, at least not
before the point where we've successfully done MapViewOfFileEx. But
if it does happen, and the chunk happens to get put in a spot that
overlaps where we want to put the shmem block, kaboom.
What seems like a plausible theory at this point is that the apparent
asynchronicity is due to the allocation being triggered by a different
thread, and the fact that our added monitoring code seems to make the
failure more likely can be explained by that code changing the timing.
But what thread could it be? It doesn't really look to me like either
the signal thread or the timer thread could eat 4MB. syslogger.c
also spawns a thread, on Windows, but AFAICS that's not being used in
this test configuration. Maybe the reason dory is showing the problem
is something or other is spawning a thread we don't even know about?
I'm going to go put a 1-sec sleep into the beginning of
PGSharedMemoryReAttach and see if that changes anything. If I'm right
that this is being triggered by another thread, that should allow the
other thread to do its thing (at least most of the time) so that the
failure rate ought to go way down.
Even if that does happen, I'm at a loss for a reasonable way to fix it
for real. Is there a way to seize control of a Windows process so that
there are no other running threads? Any other ideas?
regards, tom lane
Him
On 2018-04-30 20:01:40 -0400, Tom Lane wrote:
What seems like a plausible theory at this point is that the apparent
asynchronicity is due to the allocation being triggered by a different
thread, and the fact that our added monitoring code seems to make the
failure more likely can be explained by that code changing the timing.
But what thread could it be? It doesn't really look to me like either
the signal thread or the timer thread could eat 4MB.
It seems plausible that the underlying allocator allocates larger chunks
to serve small allocations. But we don't seem to have started any threads
at PGSharedMemoryReAttach() time? So it'd have to be something else that
starts threads.
Heath, could you use process explorer or such to check which processes
are running inside a working backend process?
Greetings,
Andres Freund
On Mon, Apr 30, 2018 at 08:01:40PM -0400, Tom Lane wrote:
It's clear from dory's results that something is causing a 4MB chunk
of memory to get reserved in the process's address space, sometimes.
It might happen during the main MapViewOfFileEx call, or during the
preceding VirtualFree, or with my map/unmap dance in place, it might
happen during that. Frequently it doesn't happen at all, at least not
before the point where we've successfully done MapViewOfFileEx. But
if it does happen, and the chunk happens to get put in a spot that
overlaps where we want to put the shmem block, kaboom.What seems like a plausible theory at this point is that the apparent
asynchronicity is due to the allocation being triggered by a different
thread, and the fact that our added monitoring code seems to make the
failure more likely can be explained by that code changing the timing.
But what thread could it be? It doesn't really look to me like either
the signal thread or the timer thread could eat 4MB. syslogger.c
also spawns a thread, on Windows, but AFAICS that's not being used in
this test configuration. Maybe the reason dory is showing the problem
is something or other is spawning a thread we don't even know about?
Likely some privileged daemon is creating a thread in every new process. (On
Windows, it's not unusual for one process to create a thread in another
process.) We don't have good control over that.
I'm at a loss for a reasonable way to fix it
for real. Is there a way to seize control of a Windows process so that
there are no other running threads?
I think not.
Any other ideas?
PostgreSQL could retry the whole process creation, analogous to
internal_forkexec() retries. Have the failed process exit after recording the
fact that it couldn't attach. Make the postmaster notice and spawn a
replacement. Give up after 100 failed attempts.
Noah Misch <noah@leadboat.com> writes:
On Mon, Apr 30, 2018 at 08:01:40PM -0400, Tom Lane wrote:
What seems like a plausible theory at this point is that the apparent
asynchronicity is due to the allocation being triggered by a different
thread, and the fact that our added monitoring code seems to make the
failure more likely can be explained by that code changing the timing.
But what thread could it be? It doesn't really look to me like either
the signal thread or the timer thread could eat 4MB. syslogger.c
also spawns a thread, on Windows, but AFAICS that's not being used in
this test configuration.
The 1-second wait doesn't seem to have changed things much, which puts
a hole in the idea that this is triggered by a thread spawned earlier
in backend process startup.
Maybe the reason dory is showing the problem
is something or other is spawning a thread we don't even know about?
Likely some privileged daemon is creating a thread in every new process.
Yeah, I'm afraid that's the most likely theory at this point; it offers
an explanation why we're seeing this on dory and not other machines.
Although if the daemon were responding to process startup, wouldn't the
extra wait have given it time to do so? There's still something that
doesn't add up here.
Any other ideas?
PostgreSQL could retry the whole process creation, analogous to
internal_forkexec() retries.
In the absence of any clearer theory about what's causing this,
that may be our only recourse. Sure is ugly though.
regards, tom lane
On Tue, May 1, 2018 at 2:59 PM, Noah Misch <noah@leadboat.com> wrote:
Likely some privileged daemon is creating a thread in every new process. (On
Windows, it's not unusual for one process to create a thread in another
process.) We don't have good control over that.
Huh. I was already amazed (as a non-Windows user) by the DSM code
that duplicates file handles into the postmaster process without its
cooperation, but starting threads is even more amazing.
Apparently debuggers do that. Could this be running in some kind of
debugger-managed environment or build, perhaps as a result of some
core dump capturing mode or something?
https://msdn.microsoft.com/en-us/library/windows/desktop/dd405484(v=vs.85).aspx
Apparently another way to mess with another process's memory map is
via "Asynchronous Procedure Calls":
http://blogs.microsoft.co.il/pavely/2017/03/14/injecting-a-dll-without-a-remote-thread/
It looks like that mechanism could allow something either in our own
process (perhaps some timer-related thing that we might have set up
ourselves or might be set up by the system?) or another process to
queue actions for our own thread to run at certain points.
https://msdn.microsoft.com/en-us/library/windows/desktop/ms681951(v=vs.85).aspx
--
Thomas Munro
http://www.enterprisedb.com
Andres Freund <andres@anarazel.de> writes:
Heath, could you use process explorer or such to check which processes
are running inside a working backend process?
It seems to be possible to enumerate the threads that are present inside a
Windows process, although it's not clear to me how much identifying info
is available. Perhaps it'd be worth putting in some "dump threads"
debugging code like the "dump modules" code we had in there for a bit?
regards, tom lane