"could not reattach to shared memory" on buildfarm member dory

Started by Tom Laneabout 8 years ago33 messageshackers

tgl@sss.pgh.pa.us

about 8 years ago

So far, dory has failed three times with essentially identical symptoms:

2018-04-23 19:57:10.624 GMT [2240] FATAL: could not reattach to shared memory (key=0000000000000190, addr=00000000018E0000): error code 487
2018-04-23 15:57:10.657 EDT [8836] ERROR: lost connection to parallel worker
2018-04-23 15:57:10.657 EDT [8836] STATEMENT: select count(*) from tenk1 group by twenty;
2018-04-23 15:57:10.660 EDT [3820] LOG: background worker "parallel worker" (PID 2240) exited with exit code 1

Now how can this be? We've successfully reserved and released the address
range we want to use, so it *should* be free at the instant we try to map.

Another thing that seems curious, though it may just be an artifact of
not having many data points yet, is that these failures all occurred
during pg_upgradecheck. You'd think the "check" and "install-check"
steps would be equally vulnerable to the failure.

I guess the good news is that we're seeing this in a reasonably
reproducible fashion, so there's some hope of digging down to find
out the actual cause.

regards, tom lane

Stephen Frost

sfrost@snowman.net

about 8 years ago

In reply to: Tom Lane (#1)

Re: "could not reattach to shared memory" on buildfarm member dory

Greetings,

* Tom Lane (tgl@sss.pgh.pa.us) wrote:

So far, dory has failed three times with essentially identical symptoms:

2018-04-23 19:57:10.624 GMT [2240] FATAL: could not reattach to shared memory (key=0000000000000190, addr=00000000018E0000): error code 487
2018-04-23 15:57:10.657 EDT [8836] ERROR: lost connection to parallel worker
2018-04-23 15:57:10.657 EDT [8836] STATEMENT: select count(*) from tenk1 group by twenty;
2018-04-23 15:57:10.660 EDT [3820] LOG: background worker "parallel worker" (PID 2240) exited with exit code 1

Now how can this be? We've successfully reserved and released the address
range we want to use, so it *should* be free at the instant we try to map.

Yeah, that's definitely interesting.

I guess the good news is that we're seeing this in a reasonably
reproducible fashion, so there's some hope of digging down to find
out the actual cause.

I've asked Heath to take a look at the system again and see if there's
any Windows logs or such that might help us understand what's happening.
AV was disabled on the box, so don't think it's that, at least.

Thanks!

Stephen

Thomas Munro

thomas.munro@gmail.com

about 8 years ago

In reply to: Stephen Frost (#2)

Re: "could not reattach to shared memory" on buildfarm member dory

On Tue, Apr 24, 2018 at 11:18 AM, Stephen Frost <sfrost@snowman.net> wrote:

Greetings,

* Tom Lane (tgl@sss.pgh.pa.us) wrote:

So far, dory has failed three times with essentially identical symptoms:

2018-04-23 19:57:10.624 GMT [2240] FATAL: could not reattach to shared memory (key=0000000000000190, addr=00000000018E0000): error code 487
2018-04-23 15:57:10.657 EDT [8836] ERROR: lost connection to parallel worker
2018-04-23 15:57:10.657 EDT [8836] STATEMENT: select count(*) from tenk1 group by twenty;
2018-04-23 15:57:10.660 EDT [3820] LOG: background worker "parallel worker" (PID 2240) exited with exit code 1

Now how can this be? We've successfully reserved and released the address
range we want to use, so it *should* be free at the instant we try to map.

Yeah, that's definitely interesting.

I wondered if another thread with the right timing could map something
between the VirtualFree() and MapViewOfFileEx() calls, but we don't
create the Windows signal handling thread until a bit later. Could
there be any any other threads active in the process?

Maybe try asking what's mapped there with VirtualQueryEx() on failure?

--
Thomas Munro
http://www.enterprisedb.com

Noah Misch

noah@leadboat.com

about 8 years ago

In reply to: Thomas Munro (#3)

Re: "could not reattach to shared memory" on buildfarm member dory

On Tue, Apr 24, 2018 at 11:37:33AM +1200, Thomas Munro wrote:

On Tue, Apr 24, 2018 at 11:18 AM, Stephen Frost <sfrost@snowman.net> wrote:

Greetings,

* Tom Lane (tgl@sss.pgh.pa.us) wrote:

So far, dory has failed three times with essentially identical symptoms:

2018-04-23 19:57:10.624 GMT [2240] FATAL: could not reattach to shared memory (key=0000000000000190, addr=00000000018E0000): error code 487
2018-04-23 15:57:10.657 EDT [8836] ERROR: lost connection to parallel worker
2018-04-23 15:57:10.657 EDT [8836] STATEMENT: select count(*) from tenk1 group by twenty;
2018-04-23 15:57:10.660 EDT [3820] LOG: background worker "parallel worker" (PID 2240) exited with exit code 1

Now how can this be? We've successfully reserved and released the address
range we want to use, so it *should* be free at the instant we try to map.

Yeah, that's definitely interesting.

I wondered if another thread with the right timing could map something
between the VirtualFree() and MapViewOfFileEx() calls, but we don't
create the Windows signal handling thread until a bit later. Could
there be any any other threads active in the process?

Maybe try asking what's mapped there with VirtualQueryEx() on failure?

+1. An implementation of that:
/messages/by-id/20170403065106.GA2624300@tornado.leadboat.com

Magnus Hagander

magnus@hagander.net

about 8 years ago

In reply to: Stephen Frost (#2)

Re: "could not reattach to shared memory" on buildfarm member dory

On Tue, Apr 24, 2018 at 1:18 AM, Stephen Frost <sfrost@snowman.net> wrote:

Greetings,

* Tom Lane (tgl@sss.pgh.pa.us) wrote:

So far, dory has failed three times with essentially identical symptoms:

2018-04-23 19:57:10.624 GMT [2240] FATAL: could not reattach to shared

memory (key=0000000000000190, addr=00000000018E0000): error code 487

2018-04-23 15:57:10.657 EDT [8836] ERROR: lost connection to parallel

worker

2018-04-23 15:57:10.657 EDT [8836] STATEMENT: select count(*) from

tenk1 group by twenty;

2018-04-23 15:57:10.660 EDT [3820] LOG: background worker "parallel

worker" (PID 2240) exited with exit code 1

Now how can this be? We've successfully reserved and released the

address

range we want to use, so it *should* be free at the instant we try to

map.

Yeah, that's definitely interesting.

I guess the good news is that we're seeing this in a reasonably
reproducible fashion, so there's some hope of digging down to find
out the actual cause.

I've asked Heath to take a look at the system again and see if there's
any Windows logs or such that might help us understand what's happening.
AV was disabled on the box, so don't think it's that, at least.

Disabled or uninstalled?

Back when I was combating windows AV on a daily basis, this normally did
not have the same effect. Just disabling the AV didn't actually remove the
parts that caused issues, it just hid them. Actual uninstall is what was
required.

--
Magnus Hagander
Me: https://www.hagander.net/ <http://www.hagander.net/>
Work: https://www.redpill-linpro.com/ <http://www.redpill-linpro.com/>

Craig Ringer

craig@2ndquadrant.com

about 8 years ago

In reply to: Magnus Hagander (#5)

Re: "could not reattach to shared memory" on buildfarm member dory

On 24 April 2018 at 15:18, Magnus Hagander <magnus@hagander.net> wrote:

Back when I was combating windows AV on a daily basis, this normally did not
have the same effect. Just disabling the AV didn't actually remove the parts
that caused issues, it just hid them. Actual uninstall is what was required.

Yep. Specifically, they tended to inject kernel hooks and/or load hook
DLLs that did funky and often flakey things. Often with poor awareness
of things like multiple processes opening one file for write at the
same time.

I think I heard that MS has cleaned up the situation with AV
considerably by offering more kernel infrastructure for it, and
restricting what you can do in terms of kernel extensions etc. But I
don't know how much.

In any case, do you think dropping a minidump at the point of failure
might be informative? It should contain the full memory mapping
information. For this purpose we could just create a crashdumps/
directory then abort() when we detect the error, and have the
buildfarm stop processing until someone can grab the tempdir with the
dumps, binaries, .pdb files, etc.

src/backend/port/win32/crashdump.c doesn't expose a helper function to
do all the dbghelp.dll messing around and create a crashdump, it only
allows that to be done via a crash handler. But it might make sense to
break out the actual "write a crash dump" part to a separately
callable function. I've looked at doing this before, but always got
stuck with the apparent lack of support in gdb or lldb to be used as a
library for self-dumping. You can always shell out to gcore I guess...
but ew. Or we can fork() and abort() the forked child like
https://github.com/RuntimeTools/gencore does, but again, ew.

I was thinking that maybe the buildfarm could just create crashdumps/
automatically, but then we'd need to have support infrastructure for
recording the Pg binaries and .pdb files along with the dumps,
rotating them so we don't run out of space, etc etc.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Stephen Frost

sfrost@snowman.net

about 8 years ago

In reply to: Magnus Hagander (#5)

Re: "could not reattach to shared memory" on buildfarm member dory

Greetings,

* Magnus Hagander (magnus@hagander.net) wrote:

On Tue, Apr 24, 2018 at 1:18 AM, Stephen Frost <sfrost@snowman.net> wrote:

I've asked Heath to take a look at the system again and see if there's
any Windows logs or such that might help us understand what's happening.
AV was disabled on the box, so don't think it's that, at least.

Disabled or uninstalled?

The only AV installed on the system is the "Windows Defender" thing, so
it's not some additional AV system.

Thanks!

Stephen

Tom Lane

tgl@sss.pgh.pa.us

about 8 years ago

In reply to: Noah Misch (#4)

Re: "could not reattach to shared memory" on buildfarm member dory

Noah Misch <noah@leadboat.com> writes:

On Tue, Apr 24, 2018 at 11:37:33AM +1200, Thomas Munro wrote:

Maybe try asking what's mapped there with VirtualQueryEx() on failure?

+1. An implementation of that:
/messages/by-id/20170403065106.GA2624300@tornado.leadboat.com

Not seeing any other work happening here, I pushed a little bit of
quick-hack investigation code. This is based on noting that
VirtualAllocEx is documented as rounding the allocation up to a page
boundary (4K), but there's nothing specific about whether or how much
CreateFileMapping or MapViewOfFileEx might round up. The observed
failures could be explained if those guys might eat more virtual
address space for the same request size as VirtualAllocEx does.
This is a stretch, for sure, but given the lack of any other theories
we might as well check it.

regards, tom lane

Stephen Frost

sfrost@snowman.net

about 8 years ago

In reply to: Tom Lane (#8)

Re: "could not reattach to shared memory" on buildfarm member dory

Greetings,

* Tom Lane (tgl@sss.pgh.pa.us) wrote:

Noah Misch <noah@leadboat.com> writes:

On Tue, Apr 24, 2018 at 11:37:33AM +1200, Thomas Munro wrote:

Maybe try asking what's mapped there with VirtualQueryEx() on failure?

+1. An implementation of that:
/messages/by-id/20170403065106.GA2624300@tornado.leadboat.com

Not seeing any other work happening here, I pushed a little bit of
quick-hack investigation code. This is based on noting that
VirtualAllocEx is documented as rounding the allocation up to a page
boundary (4K), but there's nothing specific about whether or how much
CreateFileMapping or MapViewOfFileEx might round up. The observed
failures could be explained if those guys might eat more virtual
address space for the same request size as VirtualAllocEx does.
This is a stretch, for sure, but given the lack of any other theories
we might as well check it.

Sounds good to me. Just as an FYI, there are a couple folks taking a
look at the system and trying to figure out what's going on. We've seen
an Event ID 1530 error in the Windows Event log associated with
vctip.exe which Visual Studio was running with the build, but only
sometimes. When vctip.exe is being run and then finishes, it goes and
cleans things up which seems to be what's triggering the 1530 and that
appears to correllate with the failures, but hard to say if that's
really a smoking gun or is just coincidence.

Thanks!

Stephen

#10

Tom Lane

tgl@sss.pgh.pa.us

about 8 years ago

In reply to: Noah Misch (#4)

Re: "could not reattach to shared memory" on buildfarm member dory

[ Thanks to Stephen for cranking up a continuous build loop on dory ]

Noah Misch <noah@leadboat.com> writes:

On Tue, Apr 24, 2018 at 11:37:33AM +1200, Thomas Munro wrote:

Maybe try asking what's mapped there with VirtualQueryEx() on failure?

+1. An implementation of that:
/messages/by-id/20170403065106.GA2624300@tornado.leadboat.com

So I tried putting in that code, and it turns the problem from something
that maybe happens in every third buildfarm run or so, to something that
happens at least a dozen times in a single "make check" step. This seems
to mean that either EnumProcessModules or GetModuleFileNameEx is itself
allocating memory, and sometimes that allocation comes out of the space
VirtualFree just freed :-(.

So we can't use those functions. We have however proven that no new
module gets loaded during VirtualFree or MapViewOfFileEx, so there
doesn't seem to be anything more to be learned from them anyway.

What it looks like to me is that MapViewOfFileEx allocates some memory and
sometimes that comes out of the wrong place. This is, um, unfortunate.
It also appears that VirtualFree might sometimes allocate some memory,
and that'd be even more unfortunate, but it's hard to be certain; the
blame might well fail on VirtualQuery instead. (Ain't Heisenbugs fun?)

The solution I was thinking about last night was to have
PGSharedMemoryReAttach call MapViewOfFileEx to map the shared memory
segment at an unspecified address, then unmap it, then call VirtualFree,
and finally call MapViewOfFileEx with the real target address. The idea
here is to get these various DLLs to set up any memory allocation pools
they're going to set up before we risk doing VirtualFree. I am not,
at this point, convinced this will fix it :-( ... but I'm not sure what
else to try.

In any case, it's still pretty unclear why dory is showing this problem
and other buildfarm members are not. whelk for instance seems to be
loading all the same DLLs and more besides.

regards, tom lane

#11

Stephen Frost

sfrost@snowman.net

about 8 years ago

In reply to: Tom Lane (#10)

Re: "could not reattach to shared memory" on buildfarm member dory

Greetings,

* Tom Lane (tgl@sss.pgh.pa.us) wrote:

[ Thanks to Stephen for cranking up a continuous build loop on dory ]

That was actually Heath, who is also trying to work this issue and has
an idea about something else which might help (and some more information
about what's happening in the event log). Adding him to the thread so
he can (easily) reply with what he's found.

Heath..?

Thanks!

Stephen

#12

Heath Lord

heath.lord@crunchydata.com

about 8 years ago

In reply to: Stephen Frost (#11)

Re: "could not reattach to shared memory" on buildfarm member dory

On Mon, Apr 30, 2018 at 2:34 PM, Stephen Frost <sfrost@snowman.net> wrote:

Greetings,

* Tom Lane (tgl@sss.pgh.pa.us) wrote:

[ Thanks to Stephen for cranking up a continuous build loop on dory ]

That was actually Heath, who is also trying to work this issue and has
an idea about something else which might help (and some more information
about what's happening in the event log). Adding him to the thread so
he can (easily) reply with what he's found.

Heath..?

Thanks!

Stephen

So what I noticed after adding the '--force' flag was that in the Event
Viewer logs there was an Error in the System log stating that "The
application-specific permission settings do not grant Local Activation
permission for the COM Server application" for the Runtime Broker. So
around 2:00 pm today I went and changed the ownership of the registry
values to Administrators so I could add the user we are running the builds
under to the list of members that have access to it. However right after I
made that change the build was actually broken for me so I am just now
turning the builds back on to run constantly to verify if this has
any effect on the issue of not being able to reattach to the shared
memory. I am hoping that this makes things more stable, however I am not
confident that these are related.
The change that I attempted prior to this, which was done last Wednesday,
was to remove the vctip.exe program from being used as this was causing
issues for us as well. This was causing an Event ID 1530 error that stated
that "The application this is listed in the event details is leaving the
registry handle open" and Windows was helpfully closing any registry values
for that user profile, and I thought that possibly when doing so it was
cleaning up the memory that was allocated prior. However this did not
change anything for us after making that change.
Any ideas or changes that we could do to help debug or verify would be
helpful. We have considered changing it to run everything as SYSTEM but if
possible we would like to avoid this for security reasons. Thank you in
advance and I appreciate all the help.

-Heath

#13

Stephen Frost

sfrost@snowman.net

about 8 years ago

In reply to: Heath Lord (#12)

Re: "could not reattach to shared memory" on buildfarm member dory

Greetings,

* Heath Lord (heath.lord@crunchydata.com) wrote:

Any ideas or changes that we could do to help debug or verify would be
helpful. We have considered changing it to run everything as SYSTEM but if
possible we would like to avoid this for security reasons. Thank you in
advance and I appreciate all the help.

Just to be clear- there is no longer anything showing up in the event
viewer associated with running the builds. There may still be an issue
with the system setup or such, but it at least seems less likely for
that to be the issue, so I'm thinking that Tom is more likely correct
that PG is doing something not quite right here.

Thanks!

Stephen

#14

Tom Lane

tgl@sss.pgh.pa.us

about 8 years ago

In reply to: Heath Lord (#12)

Re: "could not reattach to shared memory" on buildfarm member dory

Heath Lord <heath.lord@crunchydata.com> writes:

So what I noticed after adding the '--force' flag was that in the Event
Viewer logs there was an Error in the System log stating that "The
application-specific permission settings do not grant Local Activation
permission for the COM Server application" for the Runtime Broker. So
around 2:00 pm today I went and changed the ownership of the registry
values to Administrators so I could add the user we are running the builds
under to the list of members that have access to it. However right after I
made that change the build was actually broken for me so I am just now
turning the builds back on to run constantly to verify if this has
any effect on the issue of not being able to reattach to the shared
memory. I am hoping that this makes things more stable, however I am not
confident that these are related.

The build was broken on Windows between about 12:30 and 14:30 EDT,
thanks to an unrelated change. Now that that's sorted, dory is
still failing :-(.

Moreover, even though I took out the module dump logic, the failure rate
is still a good bit higher than it was before, which seems to confirm my
fear that this is a Heisenbug: either VirtualQuery itself, or the act of
elog'ing a bunch of output, is causing memory allocations to take place
that otherwise would not have.

I'm hoping that the elog output is to blame for that, and am going to
go try to rejigger the code so that we capture the memory maps into space
that was allocated before VirtualFree.

Any ideas or changes that we could do to help debug or verify would be
helpful. We have considered changing it to run everything as SYSTEM but if
possible we would like to avoid this for security reasons.

Yeah, that's no solution. Even if it were OK for dory, end users wouldn't
necessarily want to run PG that way.

regards, tom lane

#15

Tom Lane

tgl@sss.pgh.pa.us

about 8 years ago

In reply to: Tom Lane (#10)

Re: "could not reattach to shared memory" on buildfarm member dory

I wrote:

The solution I was thinking about last night was to have
PGSharedMemoryReAttach call MapViewOfFileEx to map the shared memory
segment at an unspecified address, then unmap it, then call VirtualFree,
and finally call MapViewOfFileEx with the real target address. The idea
here is to get these various DLLs to set up any memory allocation pools
they're going to set up before we risk doing VirtualFree. I am not,
at this point, convinced this will fix it :-( ... but I'm not sure what
else to try.

So the answer is that that doesn't help at all.

It's clear from dory's results that something is causing a 4MB chunk
of memory to get reserved in the process's address space, sometimes.
It might happen during the main MapViewOfFileEx call, or during the
preceding VirtualFree, or with my map/unmap dance in place, it might
happen during that. Frequently it doesn't happen at all, at least not
before the point where we've successfully done MapViewOfFileEx. But
if it does happen, and the chunk happens to get put in a spot that
overlaps where we want to put the shmem block, kaboom.

What seems like a plausible theory at this point is that the apparent
asynchronicity is due to the allocation being triggered by a different
thread, and the fact that our added monitoring code seems to make the
failure more likely can be explained by that code changing the timing.
But what thread could it be? It doesn't really look to me like either
the signal thread or the timer thread could eat 4MB. syslogger.c
also spawns a thread, on Windows, but AFAICS that's not being used in
this test configuration. Maybe the reason dory is showing the problem
is something or other is spawning a thread we don't even know about?

I'm going to go put a 1-sec sleep into the beginning of
PGSharedMemoryReAttach and see if that changes anything. If I'm right
that this is being triggered by another thread, that should allow the
other thread to do its thing (at least most of the time) so that the
failure rate ought to go way down.

Even if that does happen, I'm at a loss for a reasonable way to fix it
for real. Is there a way to seize control of a Windows process so that
there are no other running threads? Any other ideas?

regards, tom lane

#16

Andres Freund

andres@anarazel.de

about 8 years ago

In reply to: Tom Lane (#15)

Re: "could not reattach to shared memory" on buildfarm member dory

Him

On 2018-04-30 20:01:40 -0400, Tom Lane wrote:

What seems like a plausible theory at this point is that the apparent
asynchronicity is due to the allocation being triggered by a different
thread, and the fact that our added monitoring code seems to make the
failure more likely can be explained by that code changing the timing.
But what thread could it be? It doesn't really look to me like either
the signal thread or the timer thread could eat 4MB.

It seems plausible that the underlying allocator allocates larger chunks
to serve small allocations. But we don't seem to have started any threads
at PGSharedMemoryReAttach() time? So it'd have to be something else that
starts threads.

Heath, could you use process explorer or such to check which processes
are running inside a working backend process?

Greetings,

Andres Freund

#17

Noah Misch

noah@leadboat.com

about 8 years ago

In reply to: Tom Lane (#15)

Re: "could not reattach to shared memory" on buildfarm member dory

On Mon, Apr 30, 2018 at 08:01:40PM -0400, Tom Lane wrote:

It's clear from dory's results that something is causing a 4MB chunk
of memory to get reserved in the process's address space, sometimes.
It might happen during the main MapViewOfFileEx call, or during the
preceding VirtualFree, or with my map/unmap dance in place, it might
happen during that. Frequently it doesn't happen at all, at least not
before the point where we've successfully done MapViewOfFileEx. But
if it does happen, and the chunk happens to get put in a spot that
overlaps where we want to put the shmem block, kaboom.

What seems like a plausible theory at this point is that the apparent
asynchronicity is due to the allocation being triggered by a different
thread, and the fact that our added monitoring code seems to make the
failure more likely can be explained by that code changing the timing.
But what thread could it be? It doesn't really look to me like either
the signal thread or the timer thread could eat 4MB. syslogger.c
also spawns a thread, on Windows, but AFAICS that's not being used in
this test configuration. Maybe the reason dory is showing the problem
is something or other is spawning a thread we don't even know about?

Likely some privileged daemon is creating a thread in every new process. (On
Windows, it's not unusual for one process to create a thread in another
process.) We don't have good control over that.

I'm at a loss for a reasonable way to fix it
for real. Is there a way to seize control of a Windows process so that
there are no other running threads?

I think not.

Any other ideas?

PostgreSQL could retry the whole process creation, analogous to
internal_forkexec() retries. Have the failed process exit after recording the
fact that it couldn't attach. Make the postmaster notice and spawn a
replacement. Give up after 100 failed attempts.

#18

Tom Lane

tgl@sss.pgh.pa.us

about 8 years ago

In reply to: Noah Misch (#17)

Re: "could not reattach to shared memory" on buildfarm member dory

Noah Misch <noah@leadboat.com> writes:

On Mon, Apr 30, 2018 at 08:01:40PM -0400, Tom Lane wrote:

What seems like a plausible theory at this point is that the apparent
asynchronicity is due to the allocation being triggered by a different
thread, and the fact that our added monitoring code seems to make the
failure more likely can be explained by that code changing the timing.
But what thread could it be? It doesn't really look to me like either
the signal thread or the timer thread could eat 4MB. syslogger.c
also spawns a thread, on Windows, but AFAICS that's not being used in
this test configuration.

The 1-second wait doesn't seem to have changed things much, which puts
a hole in the idea that this is triggered by a thread spawned earlier
in backend process startup.

Maybe the reason dory is showing the problem
is something or other is spawning a thread we don't even know about?

Likely some privileged daemon is creating a thread in every new process.

Yeah, I'm afraid that's the most likely theory at this point; it offers
an explanation why we're seeing this on dory and not other machines.
Although if the daemon were responding to process startup, wouldn't the
extra wait have given it time to do so? There's still something that
doesn't add up here.

Any other ideas?

PostgreSQL could retry the whole process creation, analogous to
internal_forkexec() retries.

In the absence of any clearer theory about what's causing this,
that may be our only recourse. Sure is ugly though.

regards, tom lane

#19

Thomas Munro

thomas.munro@gmail.com

about 8 years ago

In reply to: Noah Misch (#17)

Re: "could not reattach to shared memory" on buildfarm member dory

On Tue, May 1, 2018 at 2:59 PM, Noah Misch <noah@leadboat.com> wrote:

Likely some privileged daemon is creating a thread in every new process. (On
Windows, it's not unusual for one process to create a thread in another
process.) We don't have good control over that.

Huh. I was already amazed (as a non-Windows user) by the DSM code
that duplicates file handles into the postmaster process without its
cooperation, but starting threads is even more amazing.
Apparently debuggers do that. Could this be running in some kind of
debugger-managed environment or build, perhaps as a result of some
core dump capturing mode or something?

https://msdn.microsoft.com/en-us/library/windows/desktop/dd405484(v=vs.85).aspx

Apparently another way to mess with another process's memory map is
via "Asynchronous Procedure Calls":

http://blogs.microsoft.co.il/pavely/2017/03/14/injecting-a-dll-without-a-remote-thread/

It looks like that mechanism could allow something either in our own
process (perhaps some timer-related thing that we might have set up
ourselves or might be set up by the system?) or another process to
queue actions for our own thread to run at certain points.

https://msdn.microsoft.com/en-us/library/windows/desktop/ms681951(v=vs.85).aspx

--
Thomas Munro
http://www.enterprisedb.com

#20

Tom Lane

tgl@sss.pgh.pa.us

about 8 years ago

In reply to: Andres Freund (#16)

Re: "could not reattach to shared memory" on buildfarm member dory

Andres Freund <andres@anarazel.de> writes:

Heath, could you use process explorer or such to check which processes
are running inside a working backend process?

It seems to be possible to enumerate the threads that are present inside a
Windows process, although it's not clear to me how much identifying info
is available. Perhaps it'd be worth putting in some "dump threads"
debugging code like the "dump modules" code we had in there for a bit?

regards, tom lane

#21

Tom Lane

tgl@sss.pgh.pa.us

about 8 years ago

In reply to: Tom Lane (#20)

#22

Noah Misch

noah@leadboat.com

almost 8 years ago

In reply to: Tom Lane (#21)

#23