BUG #5305: Postgres service stops when closing Windows session

Started by Cristian Bittelabout 16 years ago75 messageshackersbugs
Jump to latest
#1Cristian Bittel
cbittel@gmail.com
hackersbugs

The following bug has been logged online:

Bug reference: 5305
Logged by: Cristian
Email address: cbittel@gmail.com
PostgreSQL version: 8.3.9
Operating system: Windows 2003 Server Standard x64
Description: Postgres service stops when closing Windows session
Details:

We connect to Windows server using the Terminal Services Clients (mstsc),
and performs maintenance task with pgAdmin 3.

PostgreSQL service crashes when the user close session on Windows, and the
following error is recorded in the pg_log files:

LOG: server process (PID 5200) exited with exit code 128

LOG: terminating any other active server processes

WARNING: terminating connection because of crash of another server process

DETAIL: The postmaster has commanded this server process to roll back the
current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.

HINT: In a moment you should be able to reconnect to the database and
repeat your command. ..

The server has the following specs:

Windows 2003 SP2 Standard 64-bit, 4GB, NOT joined to a domain.

PostgreSQL 8.3.9

pgAdmin 3

We connect without the /console parameter.

Any ideas?

#2Robert Haas
robertmhaas@gmail.com
In reply to: Cristian Bittel (#1)
hackersbugs
Re: BUG #5305: Postgres service stops when closing Windows session

On Mon, Feb 1, 2010 at 11:28 AM, Cristian <cbittel@gmail.com> wrote:

The following bug has been logged online:

Bug reference:      5305
Logged by:          Cristian
Email address:      cbittel@gmail.com
PostgreSQL version: 8.3.9
Operating system:   Windows 2003 Server Standard x64
Description:        Postgres service stops when closing Windows session
Details:

We connect to Windows server using the Terminal Services Clients (mstsc),
and performs maintenance task with pgAdmin 3.

PostgreSQL service crashes when the user close session on Windows, and the
following error is recorded in the pg_log files:

LOG:  server process (PID 5200) exited with exit code 128

LOG:  terminating any other active server processes

WARNING:  terminating connection because of crash of another server process

DETAIL:  The postmaster has commanded this server process to roll back the
current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.

HINT:  In a moment you should be able to reconnect to the database and
repeat your command. ..

The server has the following specs:

Windows 2003 SP2 Standard 64-bit, 4GB, NOT joined to a domain.

PostgreSQL 8.3.9

pgAdmin 3

We connect without the /console parameter.

Any ideas?

So you're saying that if pgadmin is open when you close the terminal
services session, the SERVER crashes?

Did you somehow start the server in that same session, or is the
server running as a service?

...Robert

#3Cristian Bittel
cbittel@gmail.com
In reply to: Robert Haas (#2)
hackersbugs
Re: BUG #5305: Postgres service stops when closing Windows session

2010/2/3 Robert Haas <robertmhaas@gmail.com>

On Mon, Feb 1, 2010 at 11:28 AM, Cristian <cbittel@gmail.com> wrote:

The following bug has been logged online:

Bug reference: 5305
Logged by: Cristian
Email address: cbittel@gmail.com
PostgreSQL version: 8.3.9
Operating system: Windows 2003 Server Standard x64
Description: Postgres service stops when closing Windows session
Details:

We connect to Windows server using the Terminal Services Clients (mstsc),
and performs maintenance task with pgAdmin 3.

PostgreSQL service crashes when the user close session on Windows, and

the

following error is recorded in the pg_log files:

LOG: server process (PID 5200) exited with exit code 128

LOG: terminating any other active server processes

WARNING: terminating connection because of crash of another server

process

DETAIL: The postmaster has commanded this server process to roll back

the

current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.

HINT: In a moment you should be able to reconnect to the database and
repeat your command. ..

The server has the following specs:

Windows 2003 SP2 Standard 64-bit, 4GB, NOT joined to a domain.

PostgreSQL 8.3.9

pgAdmin 3

We connect without the /console parameter.

Any ideas?

So you're saying that if pgadmin is open when you close the terminal
services session, the SERVER crashes?

Did you somehow start the server in that same session, or is the
server running as a service?

...Robert

If pgAdmin is open inside any mstsc session, mine or another terminal
session of another user, the main PostgreSQL service crash.

Logs report server process exit code 128, two final lines are repeated for
each active connection to postgres from Apache server, and below (in
spanish) the Security Event Viwer where Administrator user logoff and then
"postgres" user tryed to login again to Windows:

2009-10-13 22:10:47 PYT LOG: loaded library "$libdir/plugins/plugin_
debugger.dll"
2009-10-13 22:30:08 PYT LOG: loaded library
"$libdir/plugins/plugin_debugger.dll"
2009-10-13 22:40:30 PYT LOG: loaded library
"$libdir/plugins/plugin_debugger.dll"
2009-10-13 22:50:09 PYT LOG: loaded library
"$libdir/plugins/plugin_debugger.dll"
*2009-10-13 22:57:41 PYT LOG: server process (PID 50516) exited with exit
code 128*
2009-10-13 22:57:41 PYT LOG: terminating any other active server processes
2009-10-13 22:57:41 PYT WARNING: terminating connection because of crash of
another server process
2009-10-13 22:57:41 PYT DETAIL: The postmaster has commanded this server
process to roll back the current transaction and exit, because another
server process exited abnormally and possibly corrupted shared memory.
2009-10-13 22:57:41 PYT HINT: In a moment you should be able to reconnect
to the database and repeat your command.

The extract for the events:
1) Aplication Popup: postgres.exe Application Error. Application could not
initialize.
2) Service Control Manager: PostgreSQL Database Server 8.3 stopped.
3) Security: Session Login for the "postgres" user account by the
MICROSOFT_AUTHENTICATION_PACKAGE_V1_0
4, 5) Security: Details of session login for postgres user account.

Tipo de suceso: Información
Origen del suceso: Application Popup
Categoría del suceso: Ninguno
Id. suceso: 26
Fecha: 13/10/2009
Hora: 22:57:40
Usuario: No disponible
Equipo: SVCTAG-DL6W3J1
Descripción:
Aplicación emergente: postgres.exe - Error de la aplicación : La aplicación
no se ha podido inicializar correctamente (0xc0000142). Haga clic en Aceptar
para terminar la aplicación.

Para obtener más información, vea el Centro de ayuda y soporte técnico en
http://go.microsoft.com/fwlink/events.asp.

Tipo de suceso: Información
Origen del suceso: Service Control Manager
Categoría del suceso: Ninguno
Id. suceso: 7036
Fecha: 13/10/2009
Hora: 22:57:42
Usuario: No disponible
Equipo: SVCTAG-DL6W3J1
Descripción:
El servicio PostgreSQL Database Server 8.3 entró en estado detenido.

Para obtener más información, vea el Centro de ayuda y soporte técnico en
http://go.microsoft.com/fwlink/events.asp.

Tipo de suceso: Aciertos
Origen del suceso: Security
Categoría del suceso: Inicio de sesión de la cuenta
Id. suceso: 680
Fecha: 13/10/2009
Hora: 23:00:11
Usuario: SVCTAG-DL6W3J1\postgres
Equipo: SVCTAG-DL6W3J1
Descripción:
Inicio de sesión intentado por: MICROSOFT_AUTHENTICATION_
PACKAGE_V1_0
Cuenta de inicio de sesión: postgres
Estación de trabajo de origen: SVCTAG-DL6W3J1
Código de error: 0x0

Para obtener más información, vea el Centro de ayuda y soporte técnico en
http://go.microsoft.com/fwlink/events.asp.

Tipo de suceso: Aciertos
Origen del suceso: Security
Categoría del suceso: Inicio/cierre de sesión
Id. suceso: 552
Fecha: 13/10/2009
Hora: 23:00:11
Usuario: NT AUTHORITY\SYSTEM
Equipo: SVCTAG-DL6W3J1
Descripción:
Intento de inicio de sesión usando las credenciales explícitas:
Usuario que ha iniciado sesión:
Nombre de usuario: SVCTAG-DL6W3J1$
Dominio: WORKGROUP
Id. de inicio de sesión: (0x0,0x3E7)
GUID de inicio de sesión: -
Usuario cuyas credenciales se usaron:
Nombre usuario de destino: postgres
Dominio de destino: SVCTAG-DL6W3J1
GUID de inicio de sesión de destino -

Nombre de servidor de destino: localhost
Información de servidor de destino: localhost
Id del proceso del llamador:: 428
Dirección de red de origen: -
Puerto de origen: -

Para obtener más información, vea el Centro de ayuda y soporte técnico en
http://go.microsoft.com/fwlink/events.asp.

Tipo de suceso: Aciertos
Origen del suceso: Security
Categoría del suceso: Inicio/cierre de sesión
Id. suceso: 528
Fecha: 13/10/2009
Hora: 23:00:11
Usuario: SVCTAG-DL6W3J1\postgres
Equipo: SVCTAG-DL6W3J1
Descripción:
Inicio de sesión realizado:
Nombre de usuario: postgres
Dominio: SVCTAG-DL6W3J1
Id. de inicio de sesión: (0x0,0x277734D8)
Tipo de inicio de sesión: 5
Proceso de inicio de sesión: Advapi
Paquete de autenticación: Negotiate
Nombre de estación de trabajo: SVCTAG-DL6W3J1
GUID de inicio de sesión: - Nombre de usuario del llamador:
SVCTAG-DL6W3J1$
Dominio del llamador: WORKGROUP
Id de inicio de sesión del llamador: (0x0,0x3E7)
Id del proceso del llamador: 428
Servicios transitados: -
Dirección de red de origen: -
Puerto de origen: -

Para obtener más información, vea el Centro de ayuda y soporte técnico en
http://go.microsoft.com/fwlink/events.asp.

Tipo de suceso: Aciertos
Origen del suceso: Security
Categoría del suceso: Inicio/cierre de sesión
Id. suceso: 576
Fecha: 13/10/2009
Hora: 23:00:11
Usuario: SVCTAG-DL6W3J1\postgres
Equipo: SVCTAG-DL6W3J1
Descripción:
Privilegios especiales asignados al nuevo inicio de sesión:
Usuario:
Dominio:
Id. de inicio de sesión: (0x0,0x277734D8)
Privilegios: SeImpersonatePrivilege

Para obtener más información, vea el Centro de ayuda y soporte técnico en
http://go.microsoft.com/fwlink/events.asp.

#4Robert Haas
robertmhaas@gmail.com
In reply to: Cristian Bittel (#3)
hackersbugs
Re: BUG #5305: Postgres service stops when closing Windows session

On Thu, Feb 4, 2010 at 8:38 AM, Cristian Bittel <cbittel@gmail.com> wrote:

2010/2/3 Robert Haas <robertmhaas@gmail.com>

So you're saying that if pgadmin is open when you close the terminal
services session, the SERVER crashes?

Did you somehow start the server in that same session, or is the
server running as a service?

...Robert

If pgAdmin is open inside any mstsc session, mine or another terminal
session of another user, the main PostgreSQL service crash.

Logs report server process exit code 128, two final lines are repeated for
each active connection to postgres from Apache server, and below (in
spanish) the Security Event Viwer where Administrator user logoff and then
"postgres" user tryed to login again to Windows:

That's really odd. Nothing pgAdmin does should be able to crash the
PostgreSQL server, I would think. Have you got any custom code loaded
into PostgreSQL? Or non-custom, but buggy?

I'm guessing the problem only occurs if PGadmin is actually connected
to the PostgreSQL server, but perhaps you could verify that. If so, I
would see if you can get a stack backtrace of the backend to which
PGadmin is connected.

...Robert

#5Chris Travers
chris@metatrontech.com
In reply to: Robert Haas (#4)
hackersbugs
Re: BUG #5305: Postgres service stops when closing Windows session

On Sat, Feb 6, 2010 at 2:36 PM, Robert Haas <robertmhaas@gmail.com> wrote:

That's really odd.  Nothing pgAdmin does should be able to crash the
PostgreSQL server, I would think.  Have you got any custom code loaded
into PostgreSQL?  Or non-custom, but buggy?

I'm guessing the problem only occurs if PGadmin is actually connected
to the PostgreSQL server, but perhaps you could verify that.  If so, I
would see if you can get a stack backtrace of the backend to which
PGadmin is connected.

It wouldn't surprise me if this were a Windows bug (Terminal Services
may have improved since I was supporting it but it used to be quite
common that it would cause weird behavior in applications).... I
personally think the stack trace is likely to be the best way to test
where the problem is.

Best Wishes,
Chris Travers

#6Robert Haas
robertmhaas@gmail.com
In reply to: Chris Travers (#5)
hackersbugs
Re: BUG #5305: Postgres service stops when closing Windows session

On Sat, Feb 6, 2010 at 9:09 PM, Chris Travers <chris@metatrontech.com> wrote:

On Sat, Feb 6, 2010 at 2:36 PM, Robert Haas <robertmhaas@gmail.com> wrote:

That's really odd.  Nothing pgAdmin does should be able to crash the
PostgreSQL server, I would think.  Have you got any custom code loaded
into PostgreSQL?  Or non-custom, but buggy?

I'm guessing the problem only occurs if PGadmin is actually connected
to the PostgreSQL server, but perhaps you could verify that.  If so, I
would see if you can get a stack backtrace of the backend to which
PGadmin is connected.

It wouldn't surprise me if this were a Windows bug (Terminal Services
may have improved since I was supporting it but it used to be quite
common that it would cause weird behavior in applications)....  I
personally think the stack trace is likely to be the best way to test
where the problem is.

I suspect this is the same problem as bug #4897, and probably also the
same problem as this:
http://archives.postgresql.org/pgsql-bugs/2009-08/msg00114.php

and maybe also this and this:
http://archives.postgresql.org/pgsql-bugs/2010-02/msg00179.php
http://archives.postgresql.org/pgsql-admin/2009-05/msg00105.php

Unfortunately, it seems that no one has been able to get a stack trace yet.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

#7Soporte @ Teksol SA
soporte@teksol.com.ar
In reply to: Robert Haas (#6)
hackersbugs
Re: BUG #5305: Postgres service stops when closing Windows session

From my side, i have no choice to get the stack trace from production
servers where i found the issue. I have another several servers with
almost the same config to development purposes and no crashes there. I
don't have any code into the database, there is no compiled functions,
just sql queries from php code, using persistant connections
pg_pconnect().

All bugs sseams to be the same issue, took some time to relate the
crashes with exit code 128 to the terminal session ends, sometimes
there is more than one session started.

Is just a world wide issue or is something that affects to a
non-USenglish version of Windows 2003 Standard x64 Servers?
mine are in spanish lang, other report is in french lang, other report
came from british. And seems to be independant from Postgres version,
i use 8.3.9 and there is another report with 8.4.1. There is a new
version of PgAdmin, maybe should i replace the original provided with
postgres.

all appreciate your big effort,

Cristian.

2010/8/19, Robert Haas <robertmhaas@gmail.com>:

Show quoted text

On Sat, Feb 6, 2010 at 9:09 PM, Chris Travers <chris@metatrontech.com>
wrote:

On Sat, Feb 6, 2010 at 2:36 PM, Robert Haas <robertmhaas@gmail.com> wrote:

That's really odd. Nothing pgAdmin does should be able to crash the
PostgreSQL server, I would think. Have you got any custom code loaded
into PostgreSQL? Or non-custom, but buggy?

I'm guessing the problem only occurs if PGadmin is actually connected
to the PostgreSQL server, but perhaps you could verify that. If so, I
would see if you can get a stack backtrace of the backend to which
PGadmin is connected.

It wouldn't surprise me if this were a Windows bug (Terminal Services
may have improved since I was supporting it but it used to be quite
common that it would cause weird behavior in applications).... I
personally think the stack trace is likely to be the best way to test
where the problem is.

I suspect this is the same problem as bug #4897, and probably also the
same problem as this:
http://archives.postgresql.org/pgsql-bugs/2009-08/msg00114.php

and maybe also this and this:
http://archives.postgresql.org/pgsql-bugs/2010-02/msg00179.php
http://archives.postgresql.org/pgsql-admin/2009-05/msg00105.php

Unfortunately, it seems that no one has been able to get a stack trace yet.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

#8Robert Haas
robertmhaas@gmail.com
In reply to: Robert Haas (#6)
hackersbugs
Re: [BUGS] BUG #5305: Postgres service stops when closing Windows session

[moving to -hackers]

On Thu, Aug 19, 2010 at 9:43 PM, Robert Haas <robertmhaas@gmail.com> wrote:

I suspect this is the same problem as bug #4897, and probably also the
same problem as this:
http://archives.postgresql.org/pgsql-bugs/2009-08/msg00114.php

and maybe also this and this:
http://archives.postgresql.org/pgsql-bugs/2010-02/msg00179.php
http://archives.postgresql.org/pgsql-admin/2009-05/msg00105.php

Unfortunately, it seems that no one has been able to get a stack trace yet.

Bruce pointed out yet another report of this problem to me:

http://archives.postgresql.org/pgsql-general/2010-08/msg00550.php

After some discussion with Magnus, I think what is going on here is
that the postmaster kicks off a new child process, which terminates
before it actually starts running our code, either in OS-supplied code
or some sort of "filter" like anti-spam or anti-virus software. It's
presumably NOT dying in our code because - at least AFAICS - we don't
exit(128) anywhere. One way we could possibly improve the situation
is to not treat this as a child crash - that is, don't do a
crash-and-restart cycle; just treat that backend as having done
elog(FATAL). The trick is that you need a reliable way to distinguish
between a regular child crash and an "early" child crash. Magnus
suggested perhaps we could create a mutex that the child grabs before
mapping shared memory; the postmaster could check whether the mutex
had been taken. If so, we handle the crash normally; if not, we just
chalk it up to experience and continue on.

This isn't really a "fix" for the bug in the sense that the nicest
thing of all would be to prevent the child from exiting abnormally in
the first place. But it's far from clear that we can control that.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

#9Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#8)
hackersbugs
Re: [BUGS] BUG #5305: Postgres service stops when closing Windows session

Robert Haas <robertmhaas@gmail.com> writes:

After some discussion with Magnus, I think what is going on here is
that the postmaster kicks off a new child process, which terminates
before it actually starts running our code, either in OS-supplied code
or some sort of "filter" like anti-spam or anti-virus software. It's
presumably NOT dying in our code because - at least AFAICS - we don't
exit(128) anywhere.

IIRC, in POSIX-compliant shells there's a specific convention about what
exit(128) means, and it's something that could result from exec()
failure. It might be too much of a stretch to suppose that Windows is
following that, but if it is, that would square with your idea that this
is happening during child process startup.

One way we could possibly improve the situation
is to not treat this as a child crash - that is, don't do a
crash-and-restart cycle; just treat that backend as having done
elog(FATAL).

That seems to me like a great idea for decreasing reliability, not
increasing it. If you mistakenly classify a child death as "not
a crash" then you're really seriously hosed; the best outcome you
can hope for is that the database freezes up without doing any
major damage to itself.

Furthermore, even if it is an early exit and you can afford to ignore
it, the client side is still going to see a dropped connection and tell
the user that the server crashed, and we're still going to get bug
reports about that.

I would be inclined to write this off as Windows randomness that's
unfixable on our end. We could recommend that people take a closer
look at what AV software they have installed and maybe try some other
one.

regards, tom lane

#10Magnus Hagander
magnus@hagander.net
In reply to: Tom Lane (#9)
hackersbugs
Re: [BUGS] BUG #5305: Postgres service stops when closing Windows session

On Mon, Aug 23, 2010 at 17:09, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

After some discussion with Magnus, I think what is going on here is
that the postmaster kicks off a new child process, which terminates
before it actually starts running our code, either in OS-supplied code
or some sort of "filter" like anti-spam or anti-virus software.  It's
presumably NOT dying in our code because - at least AFAICS - we don't
exit(128) anywhere.

IIRC, in POSIX-compliant shells there's a specific convention about what
exit(128) means, and it's something that could result from exec()
failure.  It might be too much of a stretch to suppose that Windows is
following that, but if it is, that would square with your idea that this
is happening during child process startup.

It is (assuming the idea is correct).

The problem is that the error code is not delivered at CreateProcess()
time - it's delivered later.

One way we could possibly improve the situation
is to not treat this as a child crash - that is, don't do a
crash-and-restart cycle; just treat that backend as having done
elog(FATAL).

That seems to me like a great idea for decreasing reliability, not
increasing it.  If you mistakenly classify a child death as "not
a crash" then you're really seriously hosed; the best outcome you
can hope for is that the database freezes up without doing any
major damage to itself.

Furthermore, even if it is an early exit and you can afford to ignore
it, the client side is still going to see a dropped connection and tell
the user that the server crashed, and we're still going to get bug
reports about that.

Yes, but it's Less Evil.

I would be inclined to write this off as Windows randomness that's
unfixable on our end.  We could recommend that people take a closer
look at what AV software they have installed and maybe try some other
one.

It may well be, but we can at least attempt to mitigate it, no?

--
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

#11Tom Lane
tgl@sss.pgh.pa.us
In reply to: Magnus Hagander (#10)
hackersbugs
Re: [BUGS] BUG #5305: Postgres service stops when closing Windows session

Magnus Hagander <magnus@hagander.net> writes:

On Mon, Aug 23, 2010 at 17:09, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I would be inclined to write this off as Windows randomness that's
unfixable on our end. �We could recommend that people take a closer
look at what AV software they have installed and maybe try some other
one.

It may well be, but we can at least attempt to mitigate it, no?

I'm not excited about a "mitigation" approach that introduces new
data-loss hazards of its very own. That doesn't meet the Less Evil
standard in my eyes.

[ thinks for a bit... ] Although maybe it'd be all right to piggyback
on the dead-man-switch code that already exists in pmsignal.c. If the
child process hasn't got as far as doing MarkPostmasterChildActive,
then in principle it should be okay to assume it hasn't touched shared
memory. This really is independent of what exit code it returned.

regards, tom lane

#12Cristian Bittel
cbittel@gmail.com
In reply to: Tom Lane (#11)
hackersbugs
Re: [BUGS] BUG #5305: Postgres service stops when closing Windows session

From the users point of view, this could be a Windows or AV issue, but just
stops Postgres service, does not affect or interfire on Windows stability or
AV stability, instead it affect your product. So if you can improve the
stability of the service (and data integrity at the most) it could be a
benefic for all.

I've found the same behavior on Postgres service when clossing MSTSC session
without any AV installed, and after some months of Postgres crashes,
administrators installed Kaspersky for Servers AV, and crashes are still
there.

Cristian.

2010/8/23 Tom Lane <tgl@sss.pgh.pa.us>

Show quoted text

Magnus Hagander <magnus@hagander.net> writes:

On Mon, Aug 23, 2010 at 17:09, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I would be inclined to write this off as Windows randomness that's
unfixable on our end. We could recommend that people take a closer
look at what AV software they have installed and maybe try some other
one.

It may well be, but we can at least attempt to mitigate it, no?

I'm not excited about a "mitigation" approach that introduces new
data-loss hazards of its very own. That doesn't meet the Less Evil
standard in my eyes.

[ thinks for a bit... ] Although maybe it'd be all right to piggyback
on the dead-man-switch code that already exists in pmsignal.c. If the
child process hasn't got as far as doing MarkPostmasterChildActive,
then in principle it should be okay to assume it hasn't touched shared
memory. This really is independent of what exit code it returned.

regards, tom lane

#13Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#11)
hackersbugs
Re: [BUGS] BUG #5305: Postgres service stops when closing Windows session

On Mon, Aug 23, 2010 at 11:37 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Magnus Hagander <magnus@hagander.net> writes:

On Mon, Aug 23, 2010 at 17:09, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I would be inclined to write this off as Windows randomness that's
unfixable on our end.  We could recommend that people take a closer
look at what AV software they have installed and maybe try some other
one.

It may well be, but we can at least attempt to mitigate it, no?

I'm not excited about a "mitigation" approach that introduces new
data-loss hazards of its very own.  That doesn't meet the Less Evil
standard in my eyes.

[ thinks for a bit... ]  Although maybe it'd be all right to piggyback
on the dead-man-switch code that already exists in pmsignal.c.  If the
child process hasn't got as far as doing MarkPostmasterChildActive,
then in principle it should be okay to assume it hasn't touched shared
memory.  This really is independent of what exit code it returned.

I'm confused. That seems like it would be LESS safe than the proposed
approach of taking a mutex just before mapping shared memory. There
is some finite amount of code that executes after shared memory is
mapped and before MarkPostmasterChildActive executes; the advantage of
the mutex is that it can be taken BEFORE shared memory is mapped. On
the other hand, if you think it's safe enough, it would certainly be
nice to use an existing mechanism rather than inventing something
totally new.

I agree that the exit code is irrelevant.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

#14Bruce Momjian
bruce@momjian.us
In reply to: Robert Haas (#8)
hackersbugs
Re: [BUGS] BUG #5305: Postgres service stops when closing Windows session

Robert Haas wrote:

[moving to -hackers]

On Thu, Aug 19, 2010 at 9:43 PM, Robert Haas <robertmhaas@gmail.com> wrote:

I suspect this is the same problem as bug #4897, and probably also the
same problem as this:
http://archives.postgresql.org/pgsql-bugs/2009-08/msg00114.php

and maybe also this and this:
http://archives.postgresql.org/pgsql-bugs/2010-02/msg00179.php
http://archives.postgresql.org/pgsql-admin/2009-05/msg00105.php

Unfortunately, it seems that no one has been able to get a stack trace yet.

Bruce pointed out yet another report of this problem to me:

http://archives.postgresql.org/pgsql-general/2010-08/msg00550.php

After some discussion with Magnus, I think what is going on here is
that the postmaster kicks off a new child process, which terminates
before it actually starts running our code, either in OS-supplied code
or some sort of "filter" like anti-spam or anti-virus software. It's
presumably NOT dying in our code because - at least AFAICS - we don't
exit(128) anywhere. One way we could possibly improve the situation
is to not treat this as a child crash - that is, don't do a
crash-and-restart cycle; just treat that backend as having done
elog(FATAL). The trick is that you need a reliable way to distinguish
between a regular child crash and an "early" child crash. Magnus
suggested perhaps we could create a mutex that the child grabs before
mapping shared memory; the postmaster could check whether the mutex
had been taken. If so, we handle the crash normally; if not, we just
chalk it up to experience and continue on.

This isn't really a "fix" for the bug in the sense that the nicest
thing of all would be to prevent the child from exiting abnormally in
the first place. But it's far from clear that we can control that.

This URL has some interesting details on our problem:

http://stackoverflow.com/questions/139090/getexitcodeprocess-returns-128

Error code 128 is identified as:

error code 128 RROR_WAIT_NO_CHILDREN 128 0x80 There are no child
processes to wait for

and the suggested cause is:

Have a look at Desktop Heap memory.

Essentially the desktop heap issue comes down to exhausted resources (eg
starting too many processes). When your app runs out of these resources,
one of the symptoms is that you won't be able to start a new process,
and the call to CreateProcess will fail with code 128.

My guess is that at the time of CreateProcess(), there is enough desktop
heap memory, but at some later time, perhaps caused by a logout, there
isn't and the process never gets started.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

#15Robert Haas
robertmhaas@gmail.com
In reply to: Bruce Momjian (#14)
hackersbugs
Re: [BUGS] BUG #5305: Postgres service stops when closing Windows session

On Tue, Aug 24, 2010 at 8:57 AM, Bruce Momjian <bruce@momjian.us> wrote:

Robert Haas wrote:

[moving to -hackers]

On Thu, Aug 19, 2010 at 9:43 PM, Robert Haas <robertmhaas@gmail.com> wrote:

I suspect this is the same problem as bug #4897, and probably also the
same problem as this:
http://archives.postgresql.org/pgsql-bugs/2009-08/msg00114.php

and maybe also this and this:
http://archives.postgresql.org/pgsql-bugs/2010-02/msg00179.php
http://archives.postgresql.org/pgsql-admin/2009-05/msg00105.php

Unfortunately, it seems that no one has been able to get a stack trace yet.

Bruce pointed out yet another report of this problem to me:

http://archives.postgresql.org/pgsql-general/2010-08/msg00550.php

After some discussion with Magnus, I think what is going on here is
that the postmaster kicks off a new child process, which terminates
before it actually starts running our code, either in OS-supplied code
or some sort of "filter" like anti-spam or anti-virus software.  It's
presumably NOT dying in our code because - at least AFAICS - we don't
exit(128) anywhere.  One way we could possibly improve the situation
is to not treat this as a child crash - that is, don't do a
crash-and-restart cycle; just treat that backend as having done
elog(FATAL).  The trick is that you need a reliable way to distinguish
between a regular child crash and an "early" child crash.  Magnus
suggested perhaps we could create a mutex that the child grabs before
mapping shared memory; the postmaster could check whether the mutex
had been taken.  If so, we handle the crash normally; if not, we just
chalk it up to experience and continue on.

This isn't really a "fix" for the bug in the sense that the nicest
thing of all would be to prevent the child from exiting abnormally in
the first place.  But it's far from clear that we can control that.

This URL has some interesting details on our problem:

       http://stackoverflow.com/questions/139090/getexitcodeprocess-returns-128

Error code 128 is identified as:

       error code 128 RROR_WAIT_NO_CHILDREN 128 0x80 There are no child
       processes to wait for

and the suggested cause is:

       Have a look at Desktop Heap memory.

       Essentially the desktop heap issue comes down to exhausted resources (eg
       starting too many processes). When your app runs out of these resources,
       one of the symptoms is that you won't be able to start a new process,
       and the call to CreateProcess will fail with code 128.

My guess is that at the time of CreateProcess(), there is enough desktop
heap memory, but at some later time, perhaps caused by a logout, there
isn't and the process never gets started.

Yeah, that seems very plausible, although exactly how to verify I don't know.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

#16Bruce Momjian
bruce@momjian.us
In reply to: Robert Haas (#15)
hackersbugs
Re: [BUGS] BUG #5305: Postgres service stops when closing Windows session

Robert Haas wrote:

This isn't really a "fix" for the bug in the sense that the nicest
thing of all would be to prevent the child from exiting abnormally in
the first place. ?But it's far from clear that we can control that.

This URL has some interesting details on our problem:

? ? ? ?http://stackoverflow.com/questions/139090/getexitcodeprocess-returns-128

Error code 128 is identified as:

? ? ? ?error code 128 RROR_WAIT_NO_CHILDREN 128 0x80 There are no child
? ? ? ?processes to wait for

and the suggested cause is:

? ? ? ?Have a look at Desktop Heap memory.

? ? ? ?Essentially the desktop heap issue comes down to exhausted resources (eg
? ? ? ?starting too many processes). When your app runs out of these resources,
? ? ? ?one of the symptoms is that you won't be able to start a new process,
? ? ? ?and the call to CreateProcess will fail with code 128.

My guess is that at the time of CreateProcess(), there is enough desktop
heap memory, but at some later time, perhaps caused by a logout, there
isn't and the process never gets started.

Yeah, that seems very plausible, although exactly how to verify I don't know.

And here is confirmation from the Microsoft web site:

http://support.microsoft.com/kb/156484

Cmd.exe, Perl.exe, or other console-mode applications may fail to
initialize properly and terminate prematurely when launched by a service
using the CreateProcess() or CreateProcessAsUser() APIs. The calling
process has no way of knowing that the launched console-mode application
has terminated prematurely.

In some instances, calling GetExitCode() against the failed process
indicates the following exit code:
128L ERROR_WAIT_NO_CHILDREN - There are no child processes to wait for.
...
Internet Information Server (IIS) may exhibit this problem
intermittently when processing CGI or Perl scripts. In this case the
browser returns the following error when executing CGI scripts:

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

#17Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#16)
hackersbugs
Re: [BUGS] BUG #5305: Postgres service stops when closing Windows session

Bruce Momjian <bruce@momjian.us> writes:

Robert Haas wrote:

Yeah, that seems very plausible, although exactly how to verify I don't know.

And here is confirmation from the Microsoft web site:

In some instances, calling GetExitCode() against the failed process
indicates the following exit code:
128L ERROR_WAIT_NO_CHILDREN - There are no child processes to wait for.

Given the existence of the deadman switch mechanism (which I hadn't
remembered when this thread started), I'm coming around to the idea that
we could just treat exit(128) as nonfatal on Windows. If for some
reason the child hadn't died instantly at startup, the deadman switch
would distinguish that from the case described here.

regards, tom lane

#18Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#17)
hackersbugs
Re: [BUGS] BUG #5305: Postgres service stops when closing Windows session

Tom Lane wrote:

Bruce Momjian <bruce@momjian.us> writes:

Robert Haas wrote:

Yeah, that seems very plausible, although exactly how to verify I don't know.

And here is confirmation from the Microsoft web site:

In some instances, calling GetExitCode() against the failed process
indicates the following exit code:
128L ERROR_WAIT_NO_CHILDREN - There are no child processes to wait for.

Given the existence of the deadman switch mechanism (which I hadn't
remembered when this thread started), I'm coming around to the idea that
we could just treat exit(128) as nonfatal on Windows. If for some
reason the child hadn't died instantly at startup, the deadman switch
would distinguish that from the case described here.

Agreed. My guess is that there is some kind of Win32 OS race condition
in allocating desktop heap memory, and that sometimes with concurrent
CreateProcess() calls, a process gets started but can't complete its
creation.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

#19Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#17)
hackersbugs
Re: [BUGS] BUG #5305: Postgres service stops when closing Windows session

Tom Lane wrote:

Bruce Momjian <bruce@momjian.us> writes:

Robert Haas wrote:

Yeah, that seems very plausible, although exactly how to verify I don't know.

And here is confirmation from the Microsoft web site:

In some instances, calling GetExitCode() against the failed process
indicates the following exit code:
128L ERROR_WAIT_NO_CHILDREN - There are no child processes to wait for.

Given the existence of the deadman switch mechanism (which I hadn't
remembered when this thread started), I'm coming around to the idea that
we could just treat exit(128) as nonfatal on Windows. If for some
reason the child hadn't died instantly at startup, the deadman switch
would distinguish that from the case described here.

Here is a more detailed explaination of the failure and its relation to
desktop heap:

http://kbalertz.com/Feedback.aspx?kbNumber=184802

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

#20Magnus Hagander
magnus@hagander.net
In reply to: Tom Lane (#17)
hackersbugs
Re: [BUGS] BUG #5305: Postgres service stops when closing Windows session

On Tue, Aug 24, 2010 at 15:58, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Bruce Momjian <bruce@momjian.us> writes:

Robert Haas wrote:

Yeah, that seems very plausible, although exactly how to verify I don't know.

And here is confirmation from the Microsoft web site:

      In some instances, calling GetExitCode() against the failed process
      indicates the following exit code:
      128L ERROR_WAIT_NO_CHILDREN - There are no child processes to wait for.

Given the existence of the deadman switch mechanism (which I hadn't
remembered when this thread started), I'm coming around to the idea that
we could just treat exit(128) as nonfatal on Windows.  If for some
reason the child hadn't died instantly at startup, the deadman switch
would distinguish that from the case described here.

Just because I had written it before you posted that, here's how the
win32-specific-set-a-flag-when-we're-in-control thing would look. But
if we're convinced that just ignoring error 128 is safe, then that's
obviously a simpler patch..

--
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

Attachments:

win32_early_death.patchapplication/octet-stream; name=win32_early_death.patchDownload+72-12
#21Bruce Momjian
bruce@momjian.us
In reply to: Magnus Hagander (#20)
hackersbugs
#22Magnus Hagander
magnus@hagander.net
In reply to: Bruce Momjian (#21)
hackersbugs
#23Robert Haas
robertmhaas@gmail.com
In reply to: Magnus Hagander (#20)
hackersbugs
#24Magnus Hagander
magnus@hagander.net
In reply to: Robert Haas (#23)
hackersbugs
#25Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#17)
hackersbugs
#26Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#25)
hackersbugs
#27Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#26)
hackersbugs
#28David Fetter
david@fetter.org
In reply to: Robert Haas (#27)
hackersbugs
#29Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#17)
hackersbugs
#30Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#29)
hackersbugs
#31Cristian Bittel
cbittel@gmail.com
In reply to: Tom Lane (#30)
hackersbugs
#32Magnus Hagander
magnus@hagander.net
In reply to: Cristian Bittel (#31)
hackersbugs
#33Dave Page
dpage@pgadmin.org
In reply to: Magnus Hagander (#32)
hackersbugs
#34Cristian Bittel
cbittel@gmail.com
In reply to: Dave Page (#33)
hackersbugs
#35Dave Page
dpage@pgadmin.org
In reply to: Cristian Bittel (#34)
hackersbugs
#36Bruce Momjian
bruce@momjian.us
In reply to: Dave Page (#35)
hackersbugs
#37Dave Page
dpage@pgadmin.org
In reply to: Bruce Momjian (#36)
hackersbugs
#38Bruce Momjian
bruce@momjian.us
In reply to: Dave Page (#37)
hackersbugs
#39Dave Page
dpage@pgadmin.org
In reply to: Bruce Momjian (#38)
hackersbugs
#40Bruce Momjian
bruce@momjian.us
In reply to: Dave Page (#39)
hackersbugs
#41Cristian Bittel
cbittel@gmail.com
In reply to: Bruce Momjian (#40)
hackersbugs
#42Dave Page
dpage@pgadmin.org
In reply to: Cristian Bittel (#41)
hackersbugs
#43Magnus Hagander
magnus@hagander.net
In reply to: Tom Lane (#17)
hackersbugs
#44Tom Lane
tgl@sss.pgh.pa.us
In reply to: Magnus Hagander (#43)
hackersbugs
#45Magnus Hagander
magnus@hagander.net
In reply to: Tom Lane (#44)
hackersbugs
#46Tom Lane
tgl@sss.pgh.pa.us
In reply to: Magnus Hagander (#45)
hackersbugs
#47Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#46)
hackersbugs
#48Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#47)
hackersbugs
#49Magnus Hagander
magnus@hagander.net
In reply to: Tom Lane (#48)
hackersbugs
#50Tom Lane
tgl@sss.pgh.pa.us
In reply to: Magnus Hagander (#49)
hackersbugs
#51Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#48)
hackersbugs
#52Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#51)
hackersbugs
#53Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#52)
hackersbugs
#54Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#53)
hackersbugs
#55Magnus Hagander
magnus@hagander.net
In reply to: Tom Lane (#54)
hackersbugs
#56Bruce Momjian
bruce@momjian.us
In reply to: Robert Haas (#53)
hackersbugs
#57Magnus Hagander
magnus@hagander.net
In reply to: Tom Lane (#46)
hackersbugs
#58Magnus Hagander
magnus@hagander.net
In reply to: Bruce Momjian (#56)
hackersbugs
#59Bruce Momjian
bruce@momjian.us
In reply to: Magnus Hagander (#58)
hackersbugs
#60Dave Page
dpage@pgadmin.org
In reply to: Bruce Momjian (#59)
hackersbugs
#61Robert Haas
robertmhaas@gmail.com
In reply to: Dave Page (#60)
hackersbugs
#62Magnus Hagander
magnus@hagander.net
In reply to: Robert Haas (#61)
hackersbugs
#63Bruce Momjian
bruce@momjian.us
In reply to: Dave Page (#60)
hackersbugs
#64Tom Lane
tgl@sss.pgh.pa.us
In reply to: Magnus Hagander (#62)
hackersbugs
#65Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#63)
hackersbugs
#66Magnus Hagander
magnus@hagander.net
In reply to: Tom Lane (#64)
hackersbugs
#67Andrew Dunstan
andrew@dunslane.net
In reply to: Magnus Hagander (#66)
hackersbugs
#68Andrew Dunstan
andrew@dunslane.net
In reply to: Andrew Dunstan (#67)
hackersbugs
#69Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrew Dunstan (#68)
hackersbugs
#70Andrew Dunstan
andrew@dunslane.net
In reply to: Tom Lane (#69)
hackersbugs
#71Dave Page
dpage@pgadmin.org
In reply to: Tom Lane (#54)
hackersbugs
#72Magnus Hagander
magnus@hagander.net
In reply to: Dave Page (#71)
hackersbugs
#73Magnus Hagander
magnus@hagander.net
In reply to: Dave Page (#71)
hackersbugs
#74Dave Page
dpage@pgadmin.org
In reply to: Magnus Hagander (#72)
hackersbugs
#75Magnus Hagander
magnus@hagander.net
In reply to: Dave Page (#74)
hackersbugs