[bug fix] Produce a crash dump before main() on Windows
Hello,
postgres.exe on Windows doesn't output a crash dump when it crashes before main() is called. The attached patch fixes this. I'd like this to be back-patched. I'll add this to the next CF.
The original problem happened on our customer's production system. Their application sometimes failed to connect to the database. That was because postgres.exe crashed due to access violation (exception code C0000005). But there was no crash dump, so we had difficulty in finding the cause. The frequency was low -- about ten times during half a year.
What caused the access violation was Symantec's antivirus software. It seems that sysfer.dll of the software intercepts registry access, during C runtime library initialization, before main() is called. So, the direct cause of this problem is not PostgreSQL.
On the other hand, it's PostgreSQL's problem that we can't get the crash dump, which makes the investigation difficult. The cause is that postmaster calls SetErrorMode() to disable the outputing of crash dumps by WER (Windows Error Reporting). This error mode is inherited from postmaster to its children. If a crash happens before the child sets up the exception handler, no crash dump is produced.
Regards
Takayuki Tsunakawa
Attachments:
crash_dump_before_main.patchapplication/octet-stream; name=crash_dump_before_main.patchDownload+7-0
On Fri, Feb 16, 2018 at 8:28 AM, Tsunakawa, Takayuki <
tsunakawa.takay@jp.fujitsu.com> wrote:
Hello,
postgres.exe on Windows doesn't output a crash dump when it crashes before
main() is called. The attached patch fixes this. I'd like this to be
back-patched. I'll add this to the next CF.The original problem happened on our customer's production system. Their
application sometimes failed to connect to the database. That was because
postgres.exe crashed due to access violation (exception code C0000005).
But there was no crash dump, so we had difficulty in finding the cause.
The frequency was low -- about ten times during half a year.What caused the access violation was Symantec's antivirus software. It
seems that sysfer.dll of the software intercepts registry access, during C
runtime library initialization, before main() is called. So, the direct
cause of this problem is not PostgreSQL.On the other hand, it's PostgreSQL's problem that we can't get the crash
dump, which makes the investigation difficult. The cause is that
postmaster calls SetErrorMode() to disable the outputing of crash dumps by
WER (Windows Error Reporting). This error mode is inherited from
postmaster to its children. If a crash happens before the child sets up
the exception handler, no crash dump is produced.
The original call to SetErrorMode() was put in there to make sure we didn't
show a popup message which would then make everything freeze (see very old
commit 27bff7502f04ee01237ed3f5a997748ae43d3a81). Doesn't this turn that
back on, so that if you are not actually there to monitor something you can
end up with stuck processes and exactly the issues we had before that one?
It might still be useful in debugging, but in that case we might need to
make it a configurable switch?
--
Magnus Hagander
Me: https://www.hagander.net/ <http://www.hagander.net/>
Work: https://www.redpill-linpro.com/ <http://www.redpill-linpro.com/>
On 20 February 2018 at 21:47, Magnus Hagander <magnus@hagander.net> wrote:
On Fri, Feb 16, 2018 at 8:28 AM, Tsunakawa, Takayuki <
tsunakawa.takay@jp.fujitsu.com> wrote:Hello,
postgres.exe on Windows doesn't output a crash dump when it crashes
before main() is called. The attached patch fixes this. I'd like this to
be back-patched. I'll add this to the next CF.The original problem happened on our customer's production system. Their
application sometimes failed to connect to the database. That was because
postgres.exe crashed due to access violation (exception code C0000005).
But there was no crash dump, so we had difficulty in finding the cause.
The frequency was low -- about ten times during half a year.What caused the access violation was Symantec's antivirus software. It
seems that sysfer.dll of the software intercepts registry access, during C
runtime library initialization, before main() is called. So, the direct
cause of this problem is not PostgreSQL.On the other hand, it's PostgreSQL's problem that we can't get the crash
dump, which makes the investigation difficult. The cause is that
postmaster calls SetErrorMode() to disable the outputing of crash dumps by
WER (Windows Error Reporting). This error mode is inherited from
postmaster to its children. If a crash happens before the child sets up
the exception handler, no crash dump is produced.The original call to SetErrorMode() was put in there to make sure we
didn't show a popup message which would then make everything freeze (see
very old commit 27bff7502f04ee01237ed3f5a997748ae43d3a81). Doesn't this
turn that back on, so that if you are not actually there to monitor
something you can end up with stuck processes and exactly the issues we had
before that one?
Ha, I just went digging for the same.
We should not disable WER when running as a service (no UI access), it will
not display an interactive dialog.
I'm not convinced we should disable it at all personally. Things have come
a long way from drwatson.exe . Disabling WER makes it hard to debug
postgres by installing Visual Studio Debugger as the hander (I always
wondered why that didn't work!) and is generally just painful. It prevents
us from collecting data via Microsoft about crashes, should we wish to do
so. And who runs Pg on windows except as a service?!
So I'm all for just removing that.
--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On 20 February 2018 at 22:18, Craig Ringer <craig@2ndquadrant.com> wrote:
So I'm all for just removing that.
... but just to be clear, about -1000 on backpatching any such thing. At
most, a new GUC that defaults to the current behaviour. But I think it's
pretty niche really.
--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Feb 20, 2018 at 3:18 PM, Craig Ringer <craig@2ndquadrant.com> wrote:
On 20 February 2018 at 21:47, Magnus Hagander <magnus@hagander.net> wrote:
On Fri, Feb 16, 2018 at 8:28 AM, Tsunakawa, Takayuki <
tsunakawa.takay@jp.fujitsu.com> wrote:Hello,
postgres.exe on Windows doesn't output a crash dump when it crashes
before main() is called. The attached patch fixes this. I'd like this to
be back-patched. I'll add this to the next CF.The original problem happened on our customer's production system.
Their application sometimes failed to connect to the database. That was
because postgres.exe crashed due to access violation (exception code
C0000005). But there was no crash dump, so we had difficulty in finding
the cause. The frequency was low -- about ten times during half a year.What caused the access violation was Symantec's antivirus software. It
seems that sysfer.dll of the software intercepts registry access, during C
runtime library initialization, before main() is called. So, the direct
cause of this problem is not PostgreSQL.On the other hand, it's PostgreSQL's problem that we can't get the crash
dump, which makes the investigation difficult. The cause is that
postmaster calls SetErrorMode() to disable the outputing of crash dumps by
WER (Windows Error Reporting). This error mode is inherited from
postmaster to its children. If a crash happens before the child sets up
the exception handler, no crash dump is produced.The original call to SetErrorMode() was put in there to make sure we
didn't show a popup message which would then make everything freeze (see
very old commit 27bff7502f04ee01237ed3f5a997748ae43d3a81). Doesn't this
turn that back on, so that if you are not actually there to monitor
something you can end up with stuck processes and exactly the issues we had
before that one?Ha, I just went digging for the same.
We should not disable WER when running as a service (no UI access), it
will not display an interactive dialog.I'm not convinced we should disable it at all personally. Things have come
a long way from drwatson.exe . Disabling WER makes it hard to debug
postgres by installing Visual Studio Debugger as the hander (I always
wondered why that didn't work!) and is generally just painful. It prevents
us from collecting data via Microsoft about crashes, should we wish to do
so. And who runs Pg on windows except as a service?!
I've seen a number of usecases where apps start it alongside the app
instead of as a service. I'm not sure how recent those apps are though, and
I'm not sure it's better than using a service in the first place (but it
does let you install things without being an admin).
We really shouldn't *break* that scenario for people. But making it work
well for the service usecase should definitely be the priority.
--
Magnus Hagander
Me: https://www.hagander.net/ <http://www.hagander.net/>
Work: https://www.redpill-linpro.com/ <http://www.redpill-linpro.com/>
On 20 February 2018 at 23:43, Magnus Hagander <magnus@hagander.net> wrote:
I've seen a number of usecases where apps start it alongside the app
instead of as a service. I'm not sure how recent those apps are though, and
I'm not sure it's better than using a service in the first place (but it
does let you install things without being an admin).We really shouldn't *break* that scenario for people. But making it work
well for the service usecase should definitely be the priority.
<http://www.redpill-linpro.com/>
Good point and agreed.
The patch proposed here means that early crashes will invoke WER. If we're
going to allow WER we should probably just do so unconditionally.
I suggest changing this to a command line flag or environment variable test
that suppresses Pg's default disabling of WER. A GUC probably doesn't make
sense; it's too niche, and too early.
I'd be in favour of leaving WER on when we find out we're in a
noninteractive service too, but that'd be a separate patch for pg11+ only.
--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
From: Craig Ringer [mailto:craig@2ndquadrant.com]
The patch proposed here means that early crashes will invoke WER. If we're
going to allow WER we should probably just do so unconditionally.I'd be in favour of leaving WER on when we find out we're in a noninteractive
service too, but that'd be a separate patch for pg11+ only.
As for PG11+, I agree that we want to always leave WER on. That is, call SetErrorMode(SEM_FAILCRITICALERRORS) but not specify SEM_NOGPFAULTERRORBOX. The problem with the current specification of PostgreSQL is that the user can only get crash dumps in a fixed folder $PGDATA\crashdumps. That location is bad because the crash dumps will be backed up together with the database cluster without the user noticing it. What's worse, the crash dumps are large. With WER, the user can control the location and size of crash dumps.
I suggest changing this to a command line flag or environment variable test
that suppresses Pg's default disabling of WER. A GUC probably doesn't make
sense; it's too niche, and too early.
As for the past major releases, it's burdensome for the user to have to specify a new flag or an environment variable when he has to get crash dumps to investigate a rare crash before main(). It's not necessary to disable WER for crashes after main() is called, because PG installs an exception handler at the beginning of main(), which works fine. So can we go with the current patch?
Another idea to add to the current patch is to move the call to SetErrorMode() to the below function, which is called first in main(). How about this?
void
pgwin32_install_crashdump_handler(void)
{
SetUnhandledExceptionFilter(crashDumpHandler);
}
Regards
Takayuki Tsunakawa
From: Tsunakawa, Takayuki [mailto:tsunakawa.takay@jp.fujitsu.com]
Another idea to add to the current patch is to move the call to SetErrorMode()
to the below function, which is called first in main(). How about this?void
pgwin32_install_crashdump_handler(void)
{
SetUnhandledExceptionFilter(crashDumpHandler);
}
I moved SetErrorMode() to the beginning of main(). It should be placed before any code which could crash. The current location is a bit late: in fact, write_stderr() crashed when WSAStartup() failed.
Regards
Takayuki Tsunakawa
Attachments:
crash_dump_before_main_v2.patchapplication/octet-stream; name=crash_dump_before_main_v2.patchDownload+10-3
On Thu, Mar 1, 2018 at 4:14 PM Tsunakawa, Takayuki <
tsunakawa.takay@jp.fujitsu.com> wrote:
From: Tsunakawa, Takayuki [mailto:tsunakawa.takay@jp.fujitsu.com]
Another idea to add to the current patch is to move the call to
SetErrorMode()
to the below function, which is called first in main(). How about this?
void
pgwin32_install_crashdump_handler(void)
{
SetUnhandledExceptionFilter(crashDumpHandler);
}I moved SetErrorMode() to the beginning of main(). It should be placed
before any code which could crash. The current location is a bit late: in
fact, write_stderr() crashed when WSAStartup() failed.
I reviewed patch and it works as per the subject, but I am not able to
verify the actual
bug that is reported in the upthread. The moving of setErrorMode() call to
the start
of the main function can handle all the cases that can lead to a crash with
no popup.
The reset of setErrorMode(0) before start of any process can generate the
coredump
because of any issue before it reaches the main() function, but this change
can lead
to stop the postmaster restart process, if no one present to observe the
scenario?
Doesn't this change can generate backward compatibility problems to
particular users?
I don't have any other comments with the current patch.
Regards,
Haribabu Kommi
Fujitsu Australia
On 26 February 2018 at 12:06, Tsunakawa, Takayuki <
tsunakawa.takay@jp.fujitsu.com> wrote:
From: Craig Ringer [mailto:craig@2ndquadrant.com]
The patch proposed here means that early crashes will invoke WER. If
we're
going to allow WER we should probably just do so unconditionally.
I'd be in favour of leaving WER on when we find out we're in a
noninteractive
service too, but that'd be a separate patch for pg11+ only.
As for PG11+, I agree that we want to always leave WER on. That is, call
SetErrorMode(SEM_FAILCRITICALERRORS) but not specify
SEM_NOGPFAULTERRORBOX. The problem with the current specification of
PostgreSQL is that the user can only get crash dumps in a fixed folder
$PGDATA\crashdumps. That location is bad because the crash dumps will be
backed up together with the database cluster without the user noticing it.
What's worse, the crash dumps are large. With WER, the user can control
the location and size of crash dumps.
Yeah, that's quite old and dates back to when Windows didn't offer much if
any control over WER in services.
--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
At Wed, 18 Jul 2018 11:12:06 +0800, Craig Ringer <craig@2ndquadrant.com> wrote in <CAMsr+YHv0KfWhA+Z=UVydpvLQ-QyLaidBqpHxQ=YqTPiDGG6dg@mail.gmail.com>
On 26 February 2018 at 12:06, Tsunakawa, Takayuki <
tsunakawa.takay@jp.fujitsu.com> wrote:From: Craig Ringer [mailto:craig@2ndquadrant.com]
The patch proposed here means that early crashes will invoke WER. If
we're
going to allow WER we should probably just do so unconditionally.
I'd be in favour of leaving WER on when we find out we're in a
noninteractive
service too, but that'd be a separate patch for pg11+ only.
As for PG11+, I agree that we want to always leave WER on. That is, call
SetErrorMode(SEM_FAILCRITICALERRORS) but not specify
SEM_NOGPFAULTERRORBOX. The problem with the current specification of
PostgreSQL is that the user can only get crash dumps in a fixed folder
$PGDATA\crashdumps. That location is bad because the crash dumps will be
backed up together with the database cluster without the user noticing it.
What's worse, the crash dumps are large. With WER, the user can control
the location and size of crash dumps.Yeah, that's quite old and dates back to when Windows didn't offer much if
any control over WER in services.
Yeah. If we want to take a crash dump, we cannot have
auto-restart. Since it is inevitable what we can do for this
would be adding a new knob for that, which cannot be turned on
together with restart_after_crash...?
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
From: Haribabu Kommi [mailto:kommi.haribabu@gmail.com]
I reviewed patch and it works as per the subject, but I am not able to verify
the actual
bug that is reported in the upthread. The moving of setErrorMode() call
to the start
of the main function can handle all the cases that can lead to a crash with
no popup.
Thank you for reviewing the patch. Yeah, I don't know reproduce the problem too, because some bug of Symantec AntiVirus randomly caused the crash before main()...
The reset of setErrorMode(0) before start of any process can generate the
coredump
because of any issue before it reaches the main() function, but this change
can lead
to stop the postmaster restart process, if no one present to observe the
scenario?
Doesn't this change can generate backward compatibility problems to
particular users?
Sorry, I couldn't get it. Could you elaborate on it? I couldn't understand how this change has an adverse effect on the postmaster restart process (restart_after_crash = on) and backward compatibility.
Regards
Takayuki Tsunakawa
On Wed, Jul 18, 2018 at 06:09:57AM +0000, Tsunakawa, Takayuki wrote:
From: Haribabu Kommi [mailto:kommi.haribabu@gmail.com]
I reviewed patch and it works as per the subject, but I am not able to verify
the actual
bug that is reported in the upthread. The moving of setErrorMode() call
to the start
of the main function can handle all the cases that can lead to a crash with
no popup.Thank you for reviewing the patch. Yeah, I don't know reproduce the
problem too, because some bug of Symantec AntiVirus randomly caused
the crash before main()...
A suggestion that I have here would be to patch manually the postmaster
with any kind of code which would make it to crash before main() is
called. Anything is fine as long as a core dump is produced, right?
--
Michael
On 18 July 2018 at 14:34, Michael Paquier <michael@paquier.xyz> wrote:
On Wed, Jul 18, 2018 at 06:09:57AM +0000, Tsunakawa, Takayuki wrote:
From: Haribabu Kommi [mailto:kommi.haribabu@gmail.com]
I reviewed patch and it works as per the subject, but I am not able to
verify
the actual
bug that is reported in the upthread. The moving of setErrorMode() call
to the start
of the main function can handle all the cases that can lead to a crashwith
no popup.
Thank you for reviewing the patch. Yeah, I don't know reproduce the
problem too, because some bug of Symantec AntiVirus randomly caused
the crash before main()...A suggestion that I have here would be to patch manually the postmaster
with any kind of code which would make it to crash before main() is
called. Anything is fine as long as a core dump is produced, right?
When I was doing the original crashdump support I wrote a crashme
extension that deliberately dereferenced a null pointer.
You'd do something similar in the postmaster for the testing.
--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Wed, Jul 18, 2018 at 4:10 PM Tsunakawa, Takayuki <
tsunakawa.takay@jp.fujitsu.com> wrote:
From: Haribabu Kommi [mailto:kommi.haribabu@gmail.com]
I reviewed patch and it works as per the subject, but I am not able to
verify
the actual
bug that is reported in the upthread. The moving of setErrorMode() call
to the start
of the main function can handle all the cases that can lead to a crashwith
no popup.
Thank you for reviewing the patch. Yeah, I don't know reproduce the
problem too, because some bug of Symantec AntiVirus randomly caused the
crash before main()...The reset of setErrorMode(0) before start of any process can generate the
coredump
because of any issue before it reaches the main() function, but thischange
can lead
to stop the postmaster restart process, if no one present to observe the
scenario?
Doesn't this change can generate backward compatibility problems to
particular users?Sorry, I couldn't get it. Could you elaborate on it? I couldn't
understand how this change has an adverse effect on the postmaster restart
process (restart_after_crash = on) and backward compatibility.
Sorry about not providing more details. It is just my doubt, because
I am not able to reproduce the problem to test it after the fix. May be
I can give a try by modifying the source code to get the crash.
My point is, With this patch, in case if the postgres crashses
before reaching main(), does it generate a popup?
If the answer to the above question is yes, then if the popup is
not attended by the DBA, I feel the postmaster is not able to
restart all the processes because it is waiting for the crash process
to close. Am I Correct?
Regards,
Haribabu Kommi
Fujitsu Australia
From: Haribabu Kommi [mailto:kommi.haribabu@gmail.com]
May be I can give a try by modifying the source code to get the crash.
Thank you, that would be great if you could come up with a good way!
My point is, With this patch, in case if the postgres crashses
before reaching main(), does it generate a popup?If the answer to the above question is yes, then if the popup is
not attended by the DBA, I feel the postmaster is not able to
restart all the processes because it is waiting for the crash process
to close. Am I Correct?
IIRC, the pop up doesn't appear under Windows service. If you start the database server with pg_ctl start on the command prompt, the pop up will appear, which I think is not bad.
Regards
Takayuki Tsunakawa
On Wed, Jul 18, 2018 at 6:39 PM Tsunakawa, Takayuki <
tsunakawa.takay@jp.fujitsu.com> wrote:
From: Haribabu Kommi [mailto:kommi.haribabu@gmail.com]
May be I can give a try by modifying the source code to get the crash.
Thank you, that would be great if you could come up with a good way!
+ if (argc > 1 && strncmp(argv[1], "--fork", 6) == 0)
+ {
+ char *ptr = NULL;
+
+ printf ("%s",*ptr);
+ }
+
I added the above code as first execution code in main() function and I am
able to
reproduce the WER popup.
My point is, With this patch, in case if the postgres crashses
before reaching main(), does it generate a popup?If the answer to the above question is yes, then if the popup is
not attended by the DBA, I feel the postmaster is not able to
restart all the processes because it is waiting for the crash process
to close. Am I Correct?IIRC, the pop up doesn't appear under Windows service. If you start the
database server with pg_ctl start on the command prompt, the pop up will
appear, which I think is not bad.
Yes, the popup doesn't appear when it is started as service and appear when
it is started from command prompt.
I also think that scenarios of starting the server from command prompt may
be less, so this patch can be useful
to find out the issue by the starting the server from command prompt.
I marked the patch as "ready for committer".
Regards,
Haribabu Kommi
Fujitsu Australia
On Wed, Jul 18, 2018 at 4:38 AM, Tsunakawa, Takayuki
<tsunakawa.takay@jp.fujitsu.com> wrote:
IIRC, the pop up doesn't appear under Windows service. If you start the database server with pg_ctl start on the command prompt, the pop up will appear, which I think is not bad.
Hmm. I think that might be bad. What makes you think it isn't?
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
From: Robert Haas [mailto:robertmhaas@gmail.com]
On Wed, Jul 18, 2018 at 4:38 AM, Tsunakawa, Takayuki
<tsunakawa.takay@jp.fujitsu.com> wrote:IIRC, the pop up doesn't appear under Windows service. If you start the
database server with pg_ctl start on the command prompt, the pop up will
appear, which I think is not bad.Hmm. I think that might be bad. What makes you think it isn't?
First, as Hari-san said, starting the server with "pg_ctl start" on the command prompt does not feel like production use but like test environment, and that usage seems mostly interactive. Second, the current PostgreSQL is not exempt from the same problem: if postmaster or standalone backend crashes before main(), the pop up would appear.
Regards
Takayuki Tsunakawa
On Mon, Feb 26, 2018 at 04:06:48AM +0000, Tsunakawa, Takayuki wrote:
As for PG11+, I agree that we want to always leave WER on. That is,
call SetErrorMode(SEM_FAILCRITICALERRORS) but not specify
SEM_NOGPFAULTERRORBOX. The problem with the current specification of
PostgreSQL is that the user can only get crash dumps in a fixed folder
$PGDATA\crashdumps. That location is bad because the crash dumps will
be backed up together with the database cluster without the user
noticing it. What's worse, the crash dumps are large. With WER, the
user can control the location and size of crash dumps.
I have not looked by myself at the problems around WER and such so I
don't have a clear opinion, but including crash dumps within backups is
*bad*. We have a set of filtering rules in basebackup.c and pg_rewind
since v11, we could make use of it and remove all contents from
crashdumps/ from what gets backed up or rewound. excludeDirContents
creates empty folders when those are listed, so we'd want to not create
an empty folder in Windows backups, which makes a bit more more
difficult.
--
Michael