BUG #15804: Assertion failure when using logging_collector with EXEC_BACKEND

Started by PG Bug reporting formalmost 7 years ago40 messagesbugs
Jump to latest
#1PG Bug reporting form
noreply@postgresql.org

The following bug has been logged on the website:

Bug reference: 15804
Logged by: Yulian Khodorkovskiy
Email address: yuli.khodorkovskiy@crunchydata.com
PostgreSQL version: Unsupported/Unknown
Operating system: Centos 7.4
Description:

The following assertion fails when compiling postgres 12 on Linux (centos
7.4) with EXEC_BACKEND and logging_collector enabled:

`Assert(UsedShmemSegAddr != NULL);` in `PGSharedMemoryNoReAttach()`

Commit 57431a911d3a650451d198846ad3194900666152 appears to have introduced
this regression by moving SysLogger_Start() before reset_shared() is called
and shared memory is initialized.

For what it's worth, Windows 10/jacana (and maybe other windows builds) on
the build farm does not use logging_collector, which is perhaps why this
regression was not caught.

Yuli

#2Yuli Khodorkovskiy
yuli.khodorkovskiy@crunchydata.com
In reply to: PG Bug reporting form (#1)
Re: BUG #15804: Assertion failure when using logging_collector with EXEC_BACKEND

On May 14, 2019, at 2:28 PM, PG Bug reporting form <noreply@postgresql.org> wrote:

The following bug has been logged on the website:

Bug reference: 15804
Logged by: Yulian Khodorkovskiy
Email address: yuli.khodorkovskiy@crunchydata.com
PostgreSQL version: Unsupported/Unknown
Operating system: Centos 7.4
Description:

The following assertion fails when compiling postgres 12 on Linux (centos
7.4) with EXEC_BACKEND and logging_collector enabled:

`Assert(UsedShmemSegAddr != NULL);` in `PGSharedMemoryNoReAttach()`

Commit 57431a911d3a650451d198846ad3194900666152 appears to have introduced
this regression by moving SysLogger_Start() before reset_shared() is called
and shared memory is initialized.

For what it's worth, Windows 10/jacana (and maybe other windows builds) on
the build farm does not use logging_collector, which is perhaps why this
regression was not caught.

Yuli

Attached is a patch that fixes the issue in the bug report.

Attachments:

0001-Initialize-shared-memory-before-calling-SysLogger_St.patchapplication/octet-stream; name=0001-Initialize-shared-memory-before-calling-SysLogger_St.patch; x-unix-mode=0644Download+5-6
#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Yuli Khodorkovskiy (#2)
Re: BUG #15804: Assertion failure when using logging_collector with EXEC_BACKEND

Yuli Khodorkovskiy <yuli.khodorkovskiy@crunchydata.com> writes:

Attached is a patch that fixes the issue in the bug report.

I do not think this is a very good way to fix it (even assuming
that it works, which I'm unsure of --- in particular, doing this
would alter the order of changes to postmaster.pid, possibly
breaking pg_ctl or other tools that look at that file). The whole
point of commit 57431a911 was to switch to log-collector logging
before we've done anything very interesting, and that would surely
include shmem setup. I'm a bit surprised actually that Peter
didn't move the SysLogger_Start call even further up; in principle
it ought to be safe to do it as soon as we have the data directory
lock file.

It might be better to give up the assertion in PGSharedMemoryNoReAttach,
and just make it work more like PGSharedMemoryDetach, ie "detach if
UsedShmemSegAddr is set, else do nothing". I don't remember for sure,
but if we do that, there might be no functional difference anymore
between those two functions, in which case we might as well merge 'em.

regards, tom lane

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#3)
Re: BUG #15804: Assertion failure when using logging_collector with EXEC_BACKEND

I wrote:

It might be better to give up the assertion in PGSharedMemoryNoReAttach,
and just make it work more like PGSharedMemoryDetach, ie "detach if
UsedShmemSegAddr is set, else do nothing". I don't remember for sure,
but if we do that, there might be no functional difference anymore
between those two functions, in which case we might as well merge 'em.

Apropos of this: I notice that commit 57431a911 is already relying
on PGSharedMemoryDetach to do the "right thing" here, because that
gets called in the non-EXEC_BACKEND code path, cf SysLogger_Start:

/* Drop our connection to postmaster's shared memory, as well */
dsm_detach_all();
PGSharedMemoryDetach();

I'm also not too happy that this comment doesn't mention the fact that
we'd (now) only be attached to shmem in the case of a postmaster restart.

BTW, it looks to me like dsm_detach_all doesn't get called anywhere in
the EXEC_BACKEND case, which seems like a separate bug.

regards, tom lane

#5Michael Paquier
michael@paquier.xyz
In reply to: Tom Lane (#3)
Re: BUG #15804: Assertion failure when using logging_collector with EXEC_BACKEND

On Tue, May 14, 2019 at 03:52:04PM -0400, Tom Lane wrote:

It might be better to give up the assertion in PGSharedMemoryNoReAttach,
and just make it work more like PGSharedMemoryDetach, ie "detach if
UsedShmemSegAddr is set, else do nothing". I don't remember for sure,
but if we do that, there might be no functional difference anymore
between those two functions, in which case we might as well merge 'em.

It would be nice to fix that before beta1 is out, or we are going to
get complaints :(

So what about dropping PGSharedMemoryNoReAttach() completely and using
(or abusing of?) the detach routine so as we rely on its no-op
behavior when a subprocess is not attached yet to a shared memory
segment. On Windows, the sub-processes calling NoReAttach don't map
to a segment per the comments in the code, and I can indeed see some
"could not map" LOG messages coming from these in the logs if
enforcing the use of the detach routine in postmaster.c. It could
actually be an advantage to unmap using UnmapViewOfFile() so as a
subsequent call of ReAttach does not finish by mapping an extra time.
We may consider lowering the LOG messages to DEBUG1, or extend the
detach routine so as we control the level of logs with an elevel to
avoid sparse LOG messages when unmapping from shared segments where we
know it would fail.

I have been playing with the attached using EXEC_BACKEND on Linux and
some Windows builds FWIW. Thoughts?
--
Michael

Attachments:

shmem-noattach.patchtext/x-diff; charset=us-asciiDownload+3-71
#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Michael Paquier (#5)
Re: BUG #15804: Assertion failure when using logging_collector with EXEC_BACKEND

Michael Paquier <michael@paquier.xyz> writes:

On Tue, May 14, 2019 at 03:52:04PM -0400, Tom Lane wrote:

It might be better to give up the assertion in PGSharedMemoryNoReAttach,
and just make it work more like PGSharedMemoryDetach, ie "detach if
UsedShmemSegAddr is set, else do nothing". I don't remember for sure,
but if we do that, there might be no functional difference anymore
between those two functions, in which case we might as well merge 'em.

It would be nice to fix that before beta1 is out, or we are going to
get complaints :(

Agreed, that's why I put it on the open items list ;-)

So what about dropping PGSharedMemoryNoReAttach() completely and using
(or abusing of?) the detach routine so as we rely on its no-op
behavior when a subprocess is not attached yet to a shared memory
segment. On Windows, the sub-processes calling NoReAttach don't map
to a segment per the comments in the code, and I can indeed see some
"could not map" LOG messages coming from these in the logs if
enforcing the use of the detach routine in postmaster.c. It could
actually be an advantage to unmap using UnmapViewOfFile() so as a
subsequent call of ReAttach does not finish by mapping an extra time.
We may consider lowering the LOG messages to DEBUG1, or extend the
detach routine so as we control the level of logs with an elevel to
avoid sparse LOG messages when unmapping from shared segments where we
know it would fail.

Hm. I don't much like ignoring errors, which is what that would
amount to. It seems like in win32_shmem.c, PGSharedMemoryNoReAttach
is already more or less right: resetting UsedShmemSegAddr to NULL and
then calling PGSharedMemoryDetach is exactly what to do. It just needs
to lose the no-longer-correct assertions. Maybe sysv_shmem.c's version
should look the same?

The real point of PGSharedMemoryNoReAttach, now that I think about it,
is that it's dealing with a case where we passed down UsedShmemSegAddr
via the BackendParameters mechanism, but we know that the segment is
not really attached (because of exec()). This is different from the
API expectations of PGSharedMemoryDetach, which thinks that non-nullness
of UsedShmemSegAddr corresponds to whether the segment is attached or
not. So we can't unify those functions without breaking things, or
at least losing the ability to distinguish errors from not-errors.

BTW, another thing I think is worth looking into is exactly why the
buildfarm failed to detect this problem. According to the code
coverage report,

https://coverage.postgresql.org/src/backend/postmaster/syslogger.c.gcov.html

we *are* exercising the syslogger at some point during a standard
test run. Perhaps the Windows animals don't run whatever test
case that is?

regards, tom lane

#7Michael Paquier
michael@paquier.xyz
In reply to: Tom Lane (#6)
Re: BUG #15804: Assertion failure when using logging_collector with EXEC_BACKEND

On Wed, May 15, 2019 at 10:39:49AM -0400, Tom Lane wrote:

Hm. I don't much like ignoring errors, which is what that would
amount to. It seems like in win32_shmem.c, PGSharedMemoryNoReAttach
is already more or less right: resetting UsedShmemSegAddr to NULL and
then calling PGSharedMemoryDetach is exactly what to do. It just needs
to lose the no-longer-correct assertions. Maybe sysv_shmem.c's version
should look the same?

The real point of PGSharedMemoryNoReAttach, now that I think about it,
is that it's dealing with a case where we passed down UsedShmemSegAddr
via the BackendParameters mechanism, but we know that the segment is
not really attached (because of exec()). This is different from the
API expectations of PGSharedMemoryDetach, which thinks that non-nullness
of UsedShmemSegAddr corresponds to whether the segment is attached or
not. So we can't unify those functions without breaking things, or
at least losing the ability to distinguish errors from not-errors.

My thoughts were to switch the LOG to DEBUG1 for the code paths where
we do not expect a failure, which is the code path from the
postmaster, so that actually comes closer to your upthread suggestion
of merging both functions. The only thing which changes is that we
would call UnmapViewOfFile(), and fail silently. We don't generate an
ERROR now for the existing code, just a LOG. So there is no actual
hard failure now? Anyway, you are right that it is no time to
redesign this interface with the current timing.

Using vcregress check with logging_collector and assertions enabled
is enough to see the failures on Windows, but the assertions actually
triggered are not the same ones as the EXEC_BACKEND case with sysv
because it's actually the three assertions in
pgwin32_ReserveSharedMemoryRegion() which blow up, not the ones in
PGSharedMemoryNoReAttach() which is able to handle its own. Removing
some assertions are per the attached avoids problems. I am able to
pass check with the logging collector enabled on Windows.

BTW, another thing I think is worth looking into is exactly why the
buildfarm failed to detect this problem. According to the code
coverage report,

https://coverage.postgresql.org/src/backend/postmaster/syslogger.c.gcov.html

we *are* exercising the syslogger at some point during a standard
test run. Perhaps the Windows animals don't run whatever test
case that is?

Now from what I can see from the related buildfarm members:
- culicidae uses EXEC_BACKEND on Linux and has assertions, but not
logging_collector.
- dory, hamerkop, woodloose and bowerbird test on Windows with MSVC,
have assertions enabled, but not logging_collector.
- jacana tests on Windows with mingw, has assertions, but it does not
enforce logging_collector.
The Windows animals do not run bincheck.

So my bet is that the coverage is from pg_ctl/t/004_logrotate.pl,
which is a test skipped on Windows. culicidae runs that, so the fact
that the failure got undetected is actually a bit of a mystery because
this uses sysv_shmem.c.

Except culicidae, I am seeing no members using EXEC_BACKEND which
enable the syslogger and use TAP tests :(
--
Michael

Attachments:

shmem-remove-asserts.patchtext/x-diff; charset=us-asciiDownload+0-5
#8Michael Paquier
michael@paquier.xyz
In reply to: Michael Paquier (#7)
Re: BUG #15804: Assertion failure when using logging_collector with EXEC_BACKEND

On Thu, May 16, 2019 at 10:38:08AM +0900, Michael Paquier wrote:

So my bet is that the coverage is from pg_ctl/t/004_logrotate.pl,
which is a test skipped on Windows. culicidae runs that, so the fact
that the failure got undetected is actually a bit of a mystery because
this uses sysv_shmem.c.

Except culicidae, I am seeing no members using EXEC_BACKEND which
enable the syslogger and use TAP tests :(

Oh, actually culicidae catches the failure, but that's burried in the
logs, and I am not able to catch up that on my machine in the TAP
tests:
TRAP: FailedAssertion("!(UsedShmemSegAddr != ((void *)0))", File:
"pg_shmem.c", Line: 848)
https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=culicidae&amp;dt=2019-05-16%2000%3A30%3A04&amp;stg=pg_ctl-check

We definitely need to improve that.

Also, I am noticing another consequence as the handling of backend
variable files also suffers consequences:
could not open backend variables file
"pgsql_tmp/pgsql_tmp.backend_var.21912.1": No such file or directory

So it seems that reverting 57431a91 is an option on the table?
Opinions?
--
Michael

#9Andres Freund
andres@anarazel.de
In reply to: Michael Paquier (#8)
Re: BUG #15804: Assertion failure when using logging_collector with EXEC_BACKEND

Hi,

On 2019-05-16 11:19:03 +0900, Michael Paquier wrote:

So it seems that reverting 57431a91 is an option on the table?
Opinions?

Seems a bit early to go that way. Given that this bug was reported
yesterday, and that there's reasonable sketches how to fix this, I'm not
clear why reverting would already be on the table?

Greetings,

Andres Freund

#10Tom Lane
tgl@sss.pgh.pa.us
In reply to: Michael Paquier (#8)
Re: BUG #15804: Assertion failure when using logging_collector with EXEC_BACKEND

Michael Paquier <michael@paquier.xyz> writes:

Also, I am noticing another consequence as the handling of backend
variable files also suffers consequences:
could not open backend variables file
"pgsql_tmp/pgsql_tmp.backend_var.21912.1": No such file or directory

Um ... where/how are you seeing that?

regards, tom lane

#11Michael Paquier
michael@paquier.xyz
In reply to: Tom Lane (#10)
Re: BUG #15804: Assertion failure when using logging_collector with EXEC_BACKEND

On Wed, May 15, 2019 at 11:32:38PM -0400, Tom Lane wrote:

Michael Paquier <michael@paquier.xyz> writes:

Also, I am noticing another consequence as the handling of backend
variable files also suffers consequences:
could not open backend variables file
"pgsql_tmp/pgsql_tmp.backend_var.21912.1": No such file or directory

Um ... where/how are you seeing that?

On HEAD with EXEC_BACKEND, just like that:
$ cd src/bin/pg_ctl && PROVE_TESTS=t/004_logrotate.pl make check
$ cat tmp_check/log/004_logrotate_primary.log
2019-05-16 12:44:03.427 JST [26433] LOG: redirecting log output to
logging collector process
2019-05-16 12:44:03.427 JST [26433] HINT: Future log output will
appear in directory "log".
could not open backend variables file
"pgsql_tmp/pgsql_tmp.backend_var.26433.1": No such file or directory

The test is able to pass, but we have a race condition between the
moment the backend file gets saved and the moment we allow it to be
read. I have not spent much time checking the stack between
InitializeMaxBackends() and RemovePgTempFiles() in postmaster.c, but
57431a9 triggers the failure.
--
Michael

#12Michael Paquier
michael@paquier.xyz
In reply to: Michael Paquier (#11)
Re: BUG #15804: Assertion failure when using logging_collector with EXEC_BACKEND

On Thu, May 16, 2019 at 12:56:17PM +0900, Michael Paquier wrote:

The test is able to pass, but we have a race condition between the
moment the backend file gets saved and the moment we allow it to be
read. I have not spent much time checking the stack between
InitializeMaxBackends() and RemovePgTempFiles() in postmaster.c, but
57431a9 triggers the failure.

Oh, I think I got it. The issue is that we call RemovePgTempFiles()
after starting the syslogger. This cannot be run with other processes
running in parallel, and with EXEC_BACKEND there is the extra case
where we have a pgsql_tmp/ at the root of the data folder, so the
syslogger complains on that. By making RemovePgTempFiles() happen
before starting the syslogger, then the test complains again about the
assertion without my previous patch applied of course. With the patch
applied, I get no complains.
--
Michael

#13Tom Lane
tgl@sss.pgh.pa.us
In reply to: Michael Paquier (#12)
Re: BUG #15804: Assertion failure when using logging_collector with EXEC_BACKEND

Michael Paquier <michael@paquier.xyz> writes:

On Thu, May 16, 2019 at 12:56:17PM +0900, Michael Paquier wrote:

The test is able to pass, but we have a race condition between the
moment the backend file gets saved and the moment we allow it to be
read. I have not spent much time checking the stack between
InitializeMaxBackends() and RemovePgTempFiles() in postmaster.c, but
57431a9 triggers the failure.

Oh, I think I got it. The issue is that we call RemovePgTempFiles()
after starting the syslogger. This cannot be run with other processes
running in parallel, and with EXEC_BACKEND there is the extra case
where we have a pgsql_tmp/ at the root of the data folder, so the
syslogger complains on that. By making RemovePgTempFiles() happen
before starting the syslogger, then the test complains again about the
assertion without my previous patch applied of course. With the patch
applied, I get no complains.

Hm, should we separate the cleanup of the root pgsql_tmp/ from the rest of
what RemovePgTempFiles does? I'm still feeling like we should be trying
to launch the syslogger as soon as possible.

regards, tom lane

#14Michael Paquier
michael@paquier.xyz
In reply to: Tom Lane (#13)
Re: BUG #15804: Assertion failure when using logging_collector with EXEC_BACKEND

On Thu, May 16, 2019 at 12:51:34AM -0400, Tom Lane wrote:

Hm, should we separate the cleanup of the root pgsql_tmp/ from the rest of
what RemovePgTempFiles does? I'm still feeling like we should be trying
to launch the syslogger as soon as possible.

Indeed, this should be safe, the syslogger is not going to touch any
of the other temp files anyway. If we begin to do that, we had better
update the comments at the top of RemovePgTempFiles()'s call, and
expose RemovePgTempFilesInDir().

Anyway, I have to admit that I am not really enthusiastic about
relaxing the assertions for the win32 and the sysv interfaces, and do
that inconsistently on top of it.

As far as my first investigations have gone, assertions in
pgwin32_ReserveSharedMemoryRegion() and
PGSharedMemoryReAttach():sysv_shmem.c would need relaxing, but we rely
on that since a7e5878, which is quite some time ago. And, actually,
the failure for pgwin32_ReserveSharedMemoryRegion() should never
happen at all, because we need to call reset_shared() before starting
the syslogger as well, no? If we want to be able to log the creation
socket messages with the logging collector, it seems to me that we
would need first to remove the dependency with the port number in
PGSharedMemoryCreate() to find a free IPC key. Thoughts?
--
Michael

#15Michael Paquier
michael@paquier.xyz
In reply to: Michael Paquier (#14)
Re: BUG #15804: Assertion failure when using logging_collector with EXEC_BACKEND

On Thu, May 16, 2019 at 03:06:28PM +0900, Michael Paquier wrote:

As far as my first investigations have gone, assertions in
pgwin32_ReserveSharedMemoryRegion() and
PGSharedMemoryReAttach():sysv_shmem.c would need relaxing, but we rely
on that since a7e5878, which is quite some time ago. And, actually,
the failure for pgwin32_ReserveSharedMemoryRegion() should never
happen at all, because we need to call reset_shared() before starting
the syslogger as well, no?

Trying to avoid calling pgwin32_ReserveSharedMemoryRegion() for
processes which are never going to connect to shared memory is
actually an interesting option, because this way there is no need to
relax the assertions it includes, and there is no point in reserving a
shared memory area if it is never going to use it. This would not
cause a process trying to reattach to cause conflicts as well, which
is why we pre-allocate the region in the first place.

So, I have been hacking my way through this idea, and finished with
the attached. I have added the changes to src/tools/msvc/ and
postgresql.conf.sample (enabling logging_collector by default) to test
the viability of the change, and after fixing the first assertion
issues and the second one with the temporary files, I have bumped into
a third issue causing recovery/t/017_shm.pl to remain stuck on Linux
with -DEXEC_BACKEND as the syslogger keeps starting and stopping, as
it enters. This gets masked because of the temporary file issues
and/or the assertion problems.

This may sound rough, but based on the timing and the time I can spend
on that, I really think that we should revert 57431a9, as this has not
been completely thought through for EXEC_BACKEND, and the patch I am
attaching is a set of rough counter-measures which keep piling up,
giving a result I am not comfortable with at all. I am not sure that
I could actually go to the bottom of it without reworking the way we
attach, detach and reattach to shared memory a process, particularly
on Windows where we don't actually need to allocate a region for a
process we know will never use shared memory (right?). And this is no
time to do so for v12. Last year I did the same kind of mistake by
underestimating the Windows-related stuff for v11 just before a
release which has resulted in an urgent revert as of df8b5f3. I don't
want to do the same mistake again.
--
Michael

Attachments:

shmem-asserts-v3.patchtext/x-diff; charset=us-asciiDownload+31-18
#16Andres Freund
andres@anarazel.de
In reply to: Michael Paquier (#15)
Re: BUG #15804: Assertion failure when using logging_collector with EXEC_BACKEND

Hi,

Peter, it'd be really good if you could chime in here PDQ.

On 2019-05-17 13:25:36 +0900, Michael Paquier wrote:

This may sound rough, but based on the timing and the time I can spend
on that, I really think that we should revert 57431a9, as this has not
been completely thought through for EXEC_BACKEND, and the patch I am
attaching is a set of rough counter-measures which keep piling up,
giving a result I am not comfortable with at all. I am not sure that
I could actually go to the bottom of it without reworking the way we
attach, detach and reattach to shared memory a process, particularly
on Windows where we don't actually need to allocate a region for a
process we know will never use shared memory (right?). And this is no
time to do so for v12.

I'm not sure I see this as sufficient reason to revert.

Last year I did the same kind of mistake by
underestimating the Windows-related stuff for v11 just before a
release which has resulted in an urgent revert as of df8b5f3. I don't
want to do the same mistake again.

I don't think a windows-only patch is the same thing as a feature, one
that makes debugging easier to boot, that's applicable to all platforms.

diff --git a/src/tools/msvc/config_default.pl b/src/tools/msvc/config_default.pl
index d7a9fc5039..de070e26e5 100644
--- a/src/tools/msvc/config_default.pl
+++ b/src/tools/msvc/config_default.pl
@@ -3,7 +3,7 @@ use strict;
use warnings;
our $config = {
-	asserts => 0,    # --enable-cassert
+	asserts => 1,    # --enable-cassert
# float4byval=>1,         # --disable-float4-byval, on by default
# float8byval=> $platformbits == 64, # --disable-float8-byval,
@@ -17,7 +17,7 @@ our $config = {
gss       => undef,    # --with-gssapi=<path>
icu       => undef,    # --with-icu=<path>
nls       => undef,    # --enable-nls=<path>
-	tap_tests => undef,    # --enable-tap-tests
+	tap_tests => 1,    # --enable-tap-tests
tcl       => undef,    # --with-tcl=<path>
perl      => undef,    # --with-perl
python    => undef,    # --with-python=<path>

Hm? Is this to coerce cfbot?

Greetings,

Andres Freund

#17Michael Paquier
michael@paquier.xyz
In reply to: Andres Freund (#16)
Re: BUG #15804: Assertion failure when using logging_collector with EXEC_BACKEND

On Fri, May 17, 2019 at 01:24:43PM -0700, Andres Freund wrote:

I don't think a windows-only patch is the same thing as a feature, one
that makes debugging easier to boot, that's applicable to all
platforms.

We are discussing here a set of problems which breaks the logging
collector to work on Windows, requiring mostly a Windows patch. And I
am not much willing to mess up with the area of shared memory handling
on Windows a couple of days before a beta.

diff --git a/src/tools/msvc/config_default.pl b/src/tools/msvc/config_default.pl
@@ -17,7 +17,7 @@ our $config = {
gss       => undef,    # --with-gssapi=<path>
icu       => undef,    # --with-icu=<path>
nls       => undef,    # --enable-nls=<path>
-	tap_tests => undef,    # --enable-tap-tests
+	tap_tests => 1,    # --enable-tap-tests
tcl       => undef,    # --with-tcl=<path>
perl      => undef,    # --with-perl
python    => undef,    # --with-python=<path>

Hm? Is this to coerce cfbot?

This is part of the patch only because it has proved to be useful for
testing with MSVC on my VMs. I am not suggesting to add that in this
patch if this gets merged into HEAD. It is actually a good idea to
enforce that sometimes for the cfbot.. I'll remember your suggestion
for future patches.
--
Michael

#18Tom Lane
tgl@sss.pgh.pa.us
In reply to: Michael Paquier (#17)
Re: BUG #15804: Assertion failure when using logging_collector with EXEC_BACKEND

Michael Paquier <michael@paquier.xyz> writes:

On Fri, May 17, 2019 at 01:24:43PM -0700, Andres Freund wrote:

I don't think a windows-only patch is the same thing as a feature, one
that makes debugging easier to boot, that's applicable to all
platforms.

We are discussing here a set of problems which breaks the logging
collector to work on Windows, requiring mostly a Windows patch. And I
am not much willing to mess up with the area of shared memory handling
on Windows a couple of days before a beta.

Yeah. I started digging around this finally, and I agree that the
Windows side of it is a mess --- in particular, on Windows
internal_forkexec() calls pgwin32_ReserveSharedMemoryRegion, which
means we have a hard-wired assumption there that no postmaster child
process is launched till after shared memory exists. EXEC_BACKEND
on Unix will not reproduce that.

At this point I agree with Michael that reverting 57431a911 is the most
prudent thing to do for beta1. While it's probably not *that* hard to
fix, it will require Windows-specific testing. Moreover, the fact that
there are race conditions involved means that one person's testing might
not catch everything. So I do not think we have enough time to get a
trustworthy fix in place by Monday. And we do have other things to
worry about.

(I'm also getting annoyed that Peter, as the original author of this
commit, isn't doing anything about the issue.)

I wouldn't be totally opposed to un-reverting after beta1, if a
better-tested patch emerges. But we don't have such a patch today
and I don't see how we'll get there this weekend.

regards, tom lane

#19Tom Lane
tgl@sss.pgh.pa.us
In reply to: Michael Paquier (#15)
Re: BUG #15804: Assertion failure when using logging_collector with EXEC_BACKEND

[ some notes for whoever tries to fix this in the future ]

Michael Paquier <michael@paquier.xyz> writes:

Trying to avoid calling pgwin32_ReserveSharedMemoryRegion() for
processes which are never going to connect to shared memory is
actually an interesting option, because this way there is no need to
relax the assertions it includes,

Yeah, I'd come to the same conclusion while thinking about this,
but I don't think you're going about it in quite the right way.

In the first place, it's a mistake to try to drive the what-to-do-about-
shared-memory decision strictly off the type of the subprocess. In the
case of the syslogger, in particular, we have not only the initial launch
to think about but also a possible relaunch if the first syslogger child
crashes. In that scenario there *would* be shared memory to detach from.

So the sketch that's emerging in my mind is

1. The postmaster should have its own state variable, say
"static bool shared_memory_exists", which it sets at the appropriate
time. (We could rely on testing UsedShmemSegAddr for that, but I don't
think that'd be good design; UsedShmemSegAddr is really private state of
the particular shmem implementation.) Since internal_forkexec is
inside postmaster.c and runs in the postmaster process, it could just
test that state variable directly to see if it needs to call
pgwin32_ReserveSharedMemoryRegion.

2. I think we'd be well advised to get rid of the process-type-specific
tests in SubPostmasterMain, too. The idea in my mind is to have the
postmaster pass down an additional switch explicitly saying what to do
about shared memory, so that what SubPostmasterMain does is roughly

if (strcmp(argv[2], "--reattach") == 0)
PGSharedMemoryReAttach();
else if (strcmp(argv[2], "--noreattach") == 0)
PGSharedMemoryNoReAttach();
else
/* do nothing, shared memory does not exist yet */ ;

Then the problem for the syslogger is solved by passing --noreattach
or not depending on whether shared_memory_exists is true yet.

I've not worked that out in any detail; getting the switches included
properly might be too much of a pain for this to be an improvement.
But for sure I don't want there to be multiple copies of that list of
which subprocess types use shared memory.

A variant idea is to include shared_memory_exists in what's passed down
by the BackendParameters mechanism, and then subprocesses can adjust
their behavior for themselves.

In any case, the thrust of all of this is that we shouldn't touch any
of the assertions in the shmem support files; rather, the way to fix
it is to improve the logic in postmaster.c so that we don't call those
functions at all in child processes that are launched before shmem
exists.

regards, tom lane

#20Michael Paquier
michael@paquier.xyz
In reply to: Tom Lane (#18)
Re: BUG #15804: Assertion failure when using logging_collector with EXEC_BACKEND

On Sat, May 18, 2019 at 12:18:28PM -0400, Tom Lane wrote:

At this point I agree with Michael that reverting 57431a911 is the most
prudent thing to do for beta1. While it's probably not *that* hard to
fix, it will require Windows-specific testing. Moreover, the fact that
there are race conditions involved means that one person's testing might
not catch everything. So I do not think we have enough time to get a
trustworthy fix in place by Monday. And we do have other things to
worry about.

Thanks for confirming. There is little time for doing the revert, and
it would have been nice seeing comments from Peter. Anyway, I should
be able to do that myself in 24 hours or so after coming back in
front of my desk so as the buildfarm can have a couple of cycles for
its work. If you prefer doing that yourself, that's of course fine by
me.

I wouldn't be totally opposed to un-reverting after beta1, if a
better-tested patch emerges. But we don't have such a patch today
and I don't see how we'll get there this weekend.

Possible. My recommendation is to keep this as potential v13 work and
focus on v12 stability. I have no doubt that after beta1 emerges we
will get enough reports to work on.
--
Michael

#21Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#18)
#22Tom Lane
tgl@sss.pgh.pa.us
In reply to: Michael Paquier (#20)
#23Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#22)
#24Andrew Dunstan
andrew@dunslane.net
In reply to: Tom Lane (#23)
#25Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrew Dunstan (#24)
#26Michael Paquier
michael@paquier.xyz
In reply to: Tom Lane (#22)
#27Michael Paquier
michael@paquier.xyz
In reply to: Tom Lane (#25)
#28Michael Paquier
michael@paquier.xyz
In reply to: Tom Lane (#19)
#29Tom Lane
tgl@sss.pgh.pa.us
In reply to: Michael Paquier (#27)
#30Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#29)
#31Michael Paquier
michael@paquier.xyz
In reply to: Tom Lane (#30)
#32Peter Eisentraut
peter_e@gmx.net
In reply to: Andres Freund (#21)
#33Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Eisentraut (#32)
#34Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#30)
#35Michael Paquier
michael@paquier.xyz
In reply to: Tom Lane (#34)
#36Tom Lane
tgl@sss.pgh.pa.us
In reply to: Michael Paquier (#35)
#37Michael Paquier
michael@paquier.xyz
In reply to: Tom Lane (#36)
#38Tom Lane
tgl@sss.pgh.pa.us
In reply to: Michael Paquier (#37)
#39Michael Paquier
michael@paquier.xyz
In reply to: Tom Lane (#38)
#40Tom Lane
tgl@sss.pgh.pa.us
In reply to: Michael Paquier (#39)