[bug fix] postgres.exe crashes with access violation on Windows while starting up

Started by Tsunakawa, Takayukiabout 8 years ago9 messages
#1Tsunakawa, Takayuki
tsunakawa.takay@jp.fujitsu.com
1 attachment(s)

Hello,

We encountered a rare and hard-to-investigate problem on Windows, which one of our customers reported. Please find the attached patch to fix that. I'll add this to the next CF.

PROBLEM
==============================

PostgreSQL sometimes crashes with the following messages. This is infrequent (but frequent for the customer); it occurred about 10 times in the past 5 months.

LOG: server process (PID 2712) was terminated by exception 0xC0000005
HINT: See C include file "ntstatus.h" for a description of the hexadecimal value.
LOG: terminating any other active server processes
WARNING: terminating connection because of crash of another server process
DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT: In a moment you should be able to reconnect to the database and repeat your command.
LOG: all server processes terminated; reinitializing

"server process" shows that an client backend crashed. The above messages indicate that the process was not running an SQL command.

PostgreSQL runs as a Windows service.

No crash dump was produced anywhere, despite the facts:
- <PGDATA>/crashdumps folder exists and is writable by the PostgreSQL user account (which is the user postgres.exe runs as)
- The Windows registry configuration allows dumping the crash dump

CAUSE
==============================

We believe WSAStartup() in main.c failed. The only conceivable error is:

WSAEPROCLIM
10067
Too many processes.
A Windows Sockets implementation may have a limit on the number of applications that can use it simultaneously. WSAStartup may fail with this error if the limit has been reached.

But I couldn't find what the limit is and whether we can tune it. We couldn't reproduce the problem.

When I pretend that WSAStartup() failed while a client backend is starting up, I could see the same phenomenon as the customer. This problem only occurs when PostgreSQL runs as a Windows service.

The bug is in write_eventlog(). It calls pgwin32_message_to_utf16() which in turn calls palloc(), which requires the memory management system to be set up (CurrentMemoryContext != NULL).

FIX
==============================

Add the check "CurrentMemoryContext != NULL" in write_eventlog() as in write_console().

NOTE
==============================

The reason is for not outputing the crash dump is a) the crash occurred before installing the Windows exception handler (pgwin32_install_crashdump_handler() call) and b) the effect of the following call in postmaster is inherited in the child process.

/* In case of general protection fault, don't show GUI popup box */
SetErrorMode(SEM_FAILCRITICALERRORS | SEM_NOGPFAULTERRORBOX);

But I'm not sure in what order we should do pgwin32_install_crashdump_handler(), startup_hacks() and steps therein, MemoryContextInit(). I think that's another patch.

Regards
Takayuki Tsunakawa

Attachments:

write_eventlog_crash.patchapplication/octet-stream; name=write_eventlog_crash.patchDownload
diff --git a/src/backend/utils/error/elog.c b/src/backend/utils/error/elog.c
index 977c038..e47c0ae 100644
--- a/src/backend/utils/error/elog.c
+++ b/src/backend/utils/error/elog.c
@@ -2117,10 +2117,15 @@ write_eventlog(int level, const char *line, int len)
 	 * try to convert the message to UTF16 and write it with ReportEventW().
 	 * Fall back on ReportEventA() if conversion failed.
 	 *
+	 * Since we palloc the structure required for conversion, also fall
+	 * through to writing unconverted if we have not yet set up
+	 * CurrentMemoryContext.
+	 *
 	 * Also verify that we are not on our way into error recursion trouble due
 	 * to error messages thrown deep inside pgwin32_message_to_UTF16().
 	 */
 	if (!in_error_recursion_trouble() &&
+		CurrentMemoryContext != NULL &&
 		GetMessageEncoding() != GetACPEncoding())
 	{
 		utf16 = pgwin32_message_to_UTF16(line, len, NULL);
#2Tsunakawa, Takayuki
tsunakawa.takay@jp.fujitsu.com
In reply to: Tsunakawa, Takayuki (#1)
1 attachment(s)
Re: [bug fix] postgres.exe crashes with access violation on Windows while starting up

From: pgsql-hackers-owner@postgresql.org

[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Tsunakawa,
Takayuki
The reason is for not outputing the crash dump is a) the crash occurred
before installing the Windows exception handler
(pgwin32_install_crashdump_handler() call) and b) the effect of the
following call in postmaster is inherited in the child process.

/* In case of general protection fault, don't show GUI popup
box */
SetErrorMode(SEM_FAILCRITICALERRORS |
SEM_NOGPFAULTERRORBOX);

But I'm not sure in what order we should do
pgwin32_install_crashdump_handler(), startup_hacks() and steps therein,
MemoryContextInit(). I think that's another patch.

Just installing the handler at the beginning of main() seems fine. Patch attached.

Regards
Takayuki Tsunakawa

Attachments:

crash_dump_before_installing_handler.patchapplication/octet-stream; name=crash_dump_before_installing_handler.patchDownload
diff --git a/src/backend/main/main.c b/src/backend/main/main.c
index 87b7d3b..f9d673f 100644
--- a/src/backend/main/main.c
+++ b/src/backend/main/main.c
@@ -61,6 +61,14 @@ main(int argc, char *argv[])
 {
 	bool		do_check_root = true;
 
+	/*
+	 * If supported on the current platform, set up a handler to be called if
+	 * the backend/postmaster crashes with a fatal signal or exception.
+	 */
+#if defined(WIN32) && defined(HAVE_MINIDUMP_TYPE)
+	pgwin32_install_crashdump_handler();
+#endif
+
 	progname = get_progname(argv[0]);
 
 	/*
@@ -82,14 +90,6 @@ main(int argc, char *argv[])
 	argv = save_ps_display_args(argc, argv);
 
 	/*
-	 * If supported on the current platform, set up a handler to be called if
-	 * the backend/postmaster crashes with a fatal signal or exception.
-	 */
-#if defined(WIN32) && defined(HAVE_MINIDUMP_TYPE)
-	pgwin32_install_crashdump_handler();
-#endif
-
-	/*
 	 * Fire up essential subsystems: error and memory management
 	 *
 	 * Code after this point is allowed to use elog/ereport, though
#3Michael Paquier
michael.paquier@gmail.com
In reply to: Tsunakawa, Takayuki (#1)
Re: [bug fix] postgres.exe crashes with access violation on Windows while starting up

On Thu, Oct 26, 2017 at 7:10 PM, Tsunakawa, Takayuki
<tsunakawa.takay@jp.fujitsu.com> wrote:

FIX
==============================

Add the check "CurrentMemoryContext != NULL" in write_eventlog() as in write_console().

* Also verify that we are not on our way into error recursion
trouble due
* to error messages thrown deep inside pgwin32_message_to_UTF16().
*/
if (!in_error_recursion_trouble() &&
+ CurrentMemoryContext != NULL &&
GetMessageEncoding() != GetACPEncoding())
{
So you are basically ready to lose any message that could be pushed
here if there is no memory context? That does not sound like a good
trade-off to me. A static buffer does not look like the best idea
either to not truncate message, so couldn't we envisage to just use
malloc? pgwin32_message_to_UTF16() is called in two places in elog.c,
and there is a full control on the error code paths.

NOTE
==============================

The reason is for not outputing the crash dump is a) the crash occurred before installing the Windows exception handler (pgwin32_install_crashdump_handler() call) and b) the effect of the following call in postmaster is inherited in the child process.

/* In case of general protection fault, don't show GUI popup box */
SetErrorMode(SEM_FAILCRITICALERRORS | SEM_NOGPFAULTERRORBOX);

But I'm not sure in what order we should do pgwin32_install_crashdump_handler(), startup_hacks() and steps therein, MemoryContextInit(). I think that's another patch.

Perhaps. I don't have a final opinion on this matter.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Tsunakawa, Takayuki
tsunakawa.takay@jp.fujitsu.com
In reply to: Michael Paquier (#3)
Re: [bug fix] postgres.exe crashes with access violation on Windows while starting up

From: pgsql-hackers-owner@postgresql.org

[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Michael Paquier
So you are basically ready to lose any message that could be pushed
here if there is no memory context? That does not sound like a good
trade-off to me. A static buffer does not look like the best idea
either to not truncate message, so couldn't we envisage to just use
malloc? pgwin32_message_to_UTF16() is called in two places in elog.c,
and there is a full control on the error code paths.

Thank you for reviewing a rare bug fix on Windows that most people wouldn't be interested in. When CurrentMemoryContext is NULL, the message is logged with ReportEventA(). This is similar to write_console().

Regards
Takayuki Tsunakawa

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Michael Paquier
michael.paquier@gmail.com
In reply to: Tsunakawa, Takayuki (#4)
Re: [bug fix] postgres.exe crashes with access violation on Windows while starting up

On Tue, Oct 31, 2017 at 6:59 AM, Tsunakawa, Takayuki
<tsunakawa.takay@jp.fujitsu.com> wrote:

From: pgsql-hackers-owner@postgresql.org

[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Michael Paquier
So you are basically ready to lose any message that could be pushed
here if there is no memory context? That does not sound like a good
trade-off to me. A static buffer does not look like the best idea
either to not truncate message, so couldn't we envisage to just use
malloc? pgwin32_message_to_UTF16() is called in two places in elog.c,
and there is a full control on the error code paths.

Thank you for reviewing a rare bug fix on Windows that most people wouldn't be interested in.

Yeah, it may take a while until a committer gets interested I am
afraid. See my bug about pg_basebackup on Windows with path names..

When CurrentMemoryContext is NULL, the message is logged with ReportEventA(). This is similar to write_console().

My point is that as Postgres is running as a service, isn't it wrong
to write a message to stderr as a fallback if the memory context is
not set? You would lose a message. It seems to me that for an
operation that can happen at a low-level like the postmaster startup,
we should really use a low-level operation as well.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Tsunakawa, Takayuki
tsunakawa.takay@jp.fujitsu.com
In reply to: Michael Paquier (#5)
Re: [bug fix] postgres.exe crashes with access violation on Windows while starting up

From: Michael Paquier [mailto:michael.paquier@gmail.com]

On Tue, Oct 31, 2017 at 6:59 AM, Tsunakawa, Takayuki
<tsunakawa.takay@jp.fujitsu.com> wrote:

When CurrentMemoryContext is NULL, the message is logged with

ReportEventA(). This is similar to write_console().

My point is that as Postgres is running as a service, isn't it wrong to
write a message to stderr as a fallback if the memory context is not set?
You would lose a message. It seems to me that for an operation that can
happen at a low-level like the postmaster startup, we should really use
a low-level operation as well.

I'm sorry I may not have been clear. With this patch, write_eventlog() outputs the message to event log with ReportEventA() when CurrentMemoryContext is NULL. It doesn't write to stderr. So the message won't be lost.

Regards
Takayuki Tsunakawa

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#7Michael Paquier
michael.paquier@gmail.com
In reply to: Tsunakawa, Takayuki (#6)
Re: [bug fix] postgres.exe crashes with access violation on Windows while starting up

On Wed, Nov 1, 2017 at 12:07 AM, Tsunakawa, Takayuki
<tsunakawa.takay@jp.fujitsu.com> wrote:

From: Michael Paquier [mailto:michael.paquier@gmail.com]

On Tue, Oct 31, 2017 at 6:59 AM, Tsunakawa, Takayuki
<tsunakawa.takay@jp.fujitsu.com> wrote:

When CurrentMemoryContext is NULL, the message is logged with

ReportEventA(). This is similar to write_console().

My point is that as Postgres is running as a service, isn't it wrong to
write a message to stderr as a fallback if the memory context is not set?
You would lose a message. It seems to me that for an operation that can
happen at a low-level like the postmaster startup, we should really use
a low-level operation as well.

I'm sorry I may not have been clear. With this patch, write_eventlog() outputs the message to event log with ReportEventA() when CurrentMemoryContext is NULL. It doesn't write to stderr. So the message won't be lost.

Oh, yes. Sorry. I got confused a bit by write_eventlog(), which is
already doing what your patch is adding for write_eventlog(). I am
switching the patch as ready for committer, I definitely agree that
you are taking the write approach here.

I am also adding Noah to get some input on this issue, as he is the
author and committer of 5f538ad0 which has improved the handling of
non-ASCII characters in this code path, and more importantly has
tweaked 43adc7a7 to handle properly transaction contexts in
pgwin32_message_to_UTF16() which is where the palloc calls happen. I
would be the one in the pool of committers who would most likely
commit your patch.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Noah Misch
noah@leadboat.com
In reply to: Michael Paquier (#7)
Re: [bug fix] postgres.exe crashes with access violation on Windows while starting up

On Sat, Oct 28, 2017 at 03:43:02PM -0700, Michael Paquier wrote:

couldn't we envisage to just use
malloc? pgwin32_message_to_UTF16() is called in two places in elog.c,
and there is a full control on the error code paths.

Switching to malloc is feasible, but it wouldn't enable PostgreSQL to handle
non-ASCII messages any earlier. Messages should be ASCII-only until the
init_locale(LC_CTYPE) call initializes MessageEncoding. (Before that call,
pgwin32_message_to_UTF16() assumes the message is UTF8-encoded. I've expanded
the comments slightly. We easily comply with that restriction today.)

On Fri, Nov 03, 2017 at 11:10:14AM +0000, Michael Paquier wrote:

I am
switching the patch as ready for committer, I definitely agree that
you are taking the write approach here.

Committed both patches.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#9Michael Paquier
michael.paquier@gmail.com
In reply to: Noah Misch (#8)
Re: [bug fix] postgres.exe crashes with access violation on Windows while starting up

On Mon, Nov 13, 2017 at 7:37 AM, Noah Misch <noah@leadboat.com> wrote:

On Fri, Nov 03, 2017 at 11:10:14AM +0000, Michael Paquier wrote:

I am
switching the patch as ready for committer, I definitely agree that
you are taking the write approach here.

s/write/right/.

Committed both patches.

Thanks for double-checking, Noah.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers