(Never?) Kill Postmaster?

Started by Stefan Schwarzerover 18 years ago39 messagesgeneral
Jump to latest
#1Stefan Schwarzer
stefan.schwarzer@grid.unep.ch

Hi there,

I read dozens of times the "TIP 2: Don't 'kill -9' the postmaster"...

Now, what am I supposed to do if I launched a query which takes ages,
and which I want to interrupt?

Thanks for any advice,

Stef

____________________________________________________________________

Stefan Schwarzer

Lean Back and Relax - Enjoy some Nature Photography:
http://photoblog.la-famille-schwarzer.de

Appetite for Global Data? UNEP GEO Data Portal:
http://geodata.grid.unep.ch
____________________________________________________________________

In reply to: Stefan Schwarzer (#1)
Re: (Never?) Kill Postmaster?

Postgres starts separate process for each connection and you can just `kill`
(not `kill -9`) the unwanted process. Or you can do `select
pg_cancel_backend(pid)` from another session.

#3Pavel Stehule
pavel.stehule@gmail.com
In reply to: Stefan Schwarzer (#1)
Re: (Never?) Kill Postmaster?

2007/10/24, Stefan Schwarzer <stefan.schwarzer@grid.unep.ch>:

Hi there,

I read dozens of times the "TIP 2: Don't 'kill -9' the postmaster"...

Now, what am I supposed to do if I launched a query which takes ages, and
which I want to interrupt?

Thanks for any advice,

Stef

Hello

you have to use more gently way

select pg_cancel_backend()
http://www.postgresql.org/docs/current/interactive/functions-admin.html

Pavel

Show quoted text

____________________________________________________________________

Stefan Schwarzer

Lean Back and Relax - Enjoy some Nature Photography:
http://photoblog.la-famille-schwarzer.de

Appetite for Global Data? UNEP GEO Data Portal:
http://geodata.grid.unep.ch

____________________________________________________________________

#4Ow Mun Heng
Ow.Mun.Heng@wdc.com
In reply to: Stefan Schwarzer (#1)
Re: (Never?) Kill Postmaster?

On Wed, 2007-10-24 at 07:57 +0200, Stefan Schwarzer wrote:

Hi there,

I read dozens of times the "TIP 2: Don't 'kill -9' the postmaster"...

Now, what am I supposed to do if I launched a query which takes ages,
and which I want to interrupt?

look for the query's procpid and then issue a select
pg_cancel_backend('the_id')

Show quoted text
#5Tommy Gildseth
tommy.gildseth@usit.uio.no
In reply to: Pavel Stehule (#3)
Re: (Never?) Kill Postmaster?

Pavel Stehule wrote:

2007/10/24, Stefan Schwarzer <stefan.schwarzer@grid.unep.ch>:

I read dozens of times the "TIP 2: Don't 'kill -9' the postmaster"...

Now, what am I supposed to do if I launched a query which takes ages, and
which I want to interrupt?

you have to use more gently way

select pg_cancel_backend()
http://www.postgresql.org/docs/current/interactive/functions-admin.html

The problem is that pg_cancel_backend() is somewhat unreliable at
cancelling wayward queries. You can try other options for kill though,
other than -9, which is kind of a last resort.

--
Tommy Gildseth

In reply to: Ow Mun Heng (#4)
Re: (Never?) Kill Postmaster?

Ow Mun Heng wrote:

look for the query's procpid and then issue a select
pg_cancel_backend('the_id')

Does it do any harm if I kill (either with signal 9 or signal 15) the
single backend process (and not the postmaster)?

Regards,
Christian

--
Deriva GmbH Tel.: +49 551 489500-42
Financial IT and Consulting Fax: +49 551 489500-91
Hans-B�ckler-Stra�e 2 http://www.deriva.de
D-37079 G�ttingen

Deriva CA Certificate: http://www.deriva.de/deriva-ca.cer

#7Michael Harris
michael.harris@ericsson.com
In reply to: Stefan Schwarzer (#1)
Re: (Never?) Kill Postmaster?

Hi,

First you need to identify the correct postgresql process. Postgresql
spawns an individual server process for each database connection. They
look something like this:

postgres 27296 7089 9 08:00 ? 00:05:52 postgres: username
databasename [local] idle

If a query was running then it would say 'SELECT' instead of 'idle'.

You can send a SIGINT (ie, -2) to that process to cancel a query, eg

kill -2 27296

In most systems SIGINT is the default for kill so you could just do kill
<pid>.

The tip is ''kill -9' the postmaster', which has two important
differences to the scenario I just described:

1) kill -9 means the OS kills the process without allowing it to clean
up after itself

2) The postmaster is the master postgresql backend process. If you want
to kill a single query you would not want to kill that.

Regards // Mike

________________________________

From: Stefan Schwarzer [mailto:stefan.schwarzer@grid.unep.ch]
Sent: Wednesday, 24 October 2007 3:58 PM
To: pgsql-general@postgresql.org
Subject: (Never?) Kill Postmaster?

Hi there,

I read dozens of times the "TIP 2: Don't 'kill -9' the postmaster"...

Now, what am I supposed to do if I launched a query which takes ages,
and which I want to interrupt?

Thanks for any advice,

Stef

____________________________________________________________________

Stefan Schwarzer

Lean Back and Relax - Enjoy some Nature Photography:

http://photoblog.la-famille-schwarzer.de

Appetite for Global Data? UNEP GEO Data Portal:

http://geodata.grid.unep.ch

____________________________________________________________________

#8Pavel Stehule
pavel.stehule@gmail.com
In reply to: Christian Schröder (#6)
Re: (Never?) Kill Postmaster?

2007/10/24, Christian Schröder <cs@deriva.de>:

Ow Mun Heng wrote:

look for the query's procpid and then issue a select
pg_cancel_backend('the_id')

Does it do any harm if I kill (either with signal 9 or signal 15) the
single backend process (and not the postmaster)?

shared memory can be broken, and probably somewhere can stay some
garbage. It's depend on query.

Pavel

Show quoted text

Regards,
Christian

--
Deriva GmbH Tel.: +49 551 489500-42
Financial IT and Consulting Fax: +49 551 489500-91
Hans-Böckler-Straße 2 http://www.deriva.de
D-37079 Göttingen

Deriva CA Certificate: http://www.deriva.de/deriva-ca.cer

---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster

#9Tom Lane
tgl@sss.pgh.pa.us
In reply to: Michael Harris (#7)
Re: (Never?) Kill Postmaster?

"Michael Harris" <michael.harris@ericsson.com> writes:

The tip is ''kill -9' the postmaster', which has two important
differences to the scenario I just described:
1) kill -9 means the OS kills the process without allowing it to clean
up after itself
2) The postmaster is the master postgresql backend process. If you want
to kill a single query you would not want to kill that.

Right: the tip is to not kill -9 the parent process; it's not saying
anything about child server processes.

If you've got a child process that's unresponsive to SIGINT then you
can send it a SIGKILL instead; the downside is that this will force a
restart of the other children too, that is you're interrupting all
database sessions not only the one. But Postgres will recover
automatically and I don't think I've ever heard of anyone getting data
corruption as a result of such a thing.

SIGKILL on the parent is an entirely different thing. You'll have to
manually restart Postgres, possibly do some manual cleanup, and there's
a small but nonzero chance of ensuing data corruption ... especially if
you fat-finger any of the manual steps. Plus there simply isn't any
good reason to do it. The postmaster should always respond to more
gentle shutdown signals, because it doesn't run any user-provided
commands that could send it off into the weeds.

Hence the TIP.

regards, tom lane

#10Bill Moran
wmoran@potentialtech.com
In reply to: Stefan Schwarzer (#1)
Re: (Never?) Kill Postmaster?

In response to Stefan Schwarzer <stefan.schwarzer@grid.unep.ch>:

Hi there,

I read dozens of times the "TIP 2: Don't 'kill -9' the postmaster"...

Now, what am I supposed to do if I launched a query which takes ages,
and which I want to interrupt?

Thanks for any advice,

Tracing through the other threads, this is obviously not the regular
wayward query, but one that won't die by the normal methods.

Unfortunately, I came across this recently, and the only solution I
found was to do a "pg_ctl restart -m i" (Yes, I tried -m f first).

Luckily, the db in question was such that the front ends didn't
suffer horribly from this and reconnected, and that the database
finished up its recovery in a timely manner.

Hopefully, I can generate a reproducible example so I can file a
bug, but haven't gotten that far with it yet.

--
Bill Moran
http://www.potentialtech.com

In reply to: Tom Lane (#9)
Re: (Never?) Kill Postmaster?

Tom Lane wrote:

"Michael Harris" <michael.harris@ericsson.com> writes:

The tip is ''kill -9' the postmaster', which has two important
differences to the scenario I just described:
1) kill -9 means the OS kills the process without allowing it to clean
up after itself
2) The postmaster is the master postgresql backend process. If you want
to kill a single query you would not want to kill that.

Right: the tip is to not kill -9 the parent process; it's not saying
anything about child server processes.

If you've got a child process that's unresponsive to SIGINT then you
can send it a SIGKILL instead; the downside is that this will force a
restart of the other children too, that is you're interrupting all
database sessions not only the one. But Postgres will recover
automatically and I don't think I've ever heard of anyone getting data
corruption as a result of such a thing.

I have been in exactly this situation today: One statement took several
hours to complete, so it should be cancelled. I tried a
"pg_cancel_backend" and a "kill -2" (which means "SIGINT" on our linux
box), but nothing happened. Since I remembered this thread, I tried a
"kill -9" on this child process. As you described, all other connections
were reset, too, and this was the message in the server log:

<2007-10-31 22:48:28 CET - chschroe> WARNING: terminating connection
because of crash of another server process
<2007-10-31 22:48:28 CET - chschroe> DETAIL: The postmaster has
commanded this server process to roll back the current transaction and
exit, because another server process exited abnormally and possibly
corrupted shared memory.

But then, when I tried to reconnect to the database, I received the
following message:

<2007-10-31 22:50:01 CET - chschroe> FATAL: the database system is in
recovery mode

Ok, you wrote "Postgres will recover automatically", but could this take
several minutes? Is that what "recovery mode" means? When nothing seemed
to happen for several minutes, I performed a (fortunately clean) restart
of the whole server. The log messages for the server restart looked
normal to me:

<2007-10-31 22:53:15 CET - > LOG: received smart shutdown request
<2007-10-31 22:53:21 CET - > LOG: all server processes terminated;
reinitializing
<2007-10-31 22:53:58 CET - > LOG: database system was interrupted at
2007-10-31 22:46:46 CET
<2007-10-31 22:53:58 CET - > LOG: checkpoint record is at 153/FE9FAF20
<2007-10-31 22:53:58 CET - > LOG: redo record is at 153/FE9FAF20; undo
record is at 0/0; shutdown FALSE
<2007-10-31 22:53:58 CET - > LOG: next transaction ID: 0/128715865;
next OID: 58311787
<2007-10-31 22:53:58 CET - > LOG: next MultiXactId: 4704; next
MultiXactOffset: 9414
<2007-10-31 22:53:58 CET - > LOG: database system was not properly shut
down; automatic recovery in progress
<2007-10-31 22:53:58 CET - > LOG: redo starts at 153/FE9FAF70
<2007-10-31 22:53:58 CET - > LOG: record with zero length at 153/FEA05E70
<2007-10-31 22:53:58 CET - > LOG: redo done at 153/FEA05E40
<2007-10-31 22:53:58 CET - > LOG: database system is ready

I hope that no data got corrupted. Is there any way to check this?

What is the conclusion of this experience? Is it contrary to the above
statements dangerous to kill (-9) a subprocess?

Regards,
Christian

--
Deriva GmbH Tel.: +49 551 489500-42
Financial IT and Consulting Fax: +49 551 489500-91
Hans-B�ckler-Stra�e 2 http://www.deriva.de
D-37079 G�ttingen

Deriva CA Certificate: http://www.deriva.de/deriva-ca.cer

#12Tom Lane
tgl@sss.pgh.pa.us
In reply to: Christian Schröder (#11)
Re: (Never?) Kill Postmaster?

=?ISO-8859-1?Q?Christian_Schr=F6der?= <cs@deriva.de> writes:

But then, when I tried to reconnect to the database, I received the
following message:

<2007-10-31 22:50:01 CET - chschroe> FATAL: the database system is in
recovery mode

Ok, you wrote "Postgres will recover automatically", but could this take
several minutes?

Yeah, potentially. I don't suppose you have any idea how long it'd been
since your last checkpoint, but what do you have checkpoint_timeout and
checkpoint_segments set to?

What is the conclusion of this experience?

That you have an itchy trigger finger ;-). It looks from the log that
you needed to wait only a few seconds longer for the recovery to have
completed.

What I'd like to know about is why the child process was unresponsive to
SIGINT in the first place. There's little we can do about long-running
plpython functions, for instance, but if it was looping in Postgres
proper then we should do something about that. Can you reproduce this
problem easily?

regards, tom lane

In reply to: Tom Lane (#12)
Re: (Never?) Kill Postmaster?

Tom Lane wrote:

Ok, you wrote "Postgres will recover automatically", but could this take
several minutes?

Yeah, potentially. I don't suppose you have any idea how long it'd been
since your last checkpoint, but what do you have checkpoint_timeout and
checkpoint_segments set to?

I did not change these parameters from their default values, so
checkpoint_timeout is 5 min and checkpoint_segments is 8.

What I'd like to know about is why the child process was unresponsive to
SIGINT in the first place. There's little we can do about long-running
plpython functions, for instance, but if it was looping in Postgres
proper then we should do something about that. Can you reproduce this
problem easily?

Unfortunately not. I have tried the same query and it took only about 1
sec to complete. In fact, it's a simple seq scan with a single filter
condition. No user defined functions are involved.
Maybe it has something to do with the users connecting from their
Windows machines to the PostgreSQL server using psqlodbc. On the other
hand, it has not been the first time that such a user connection had to
be terminated and we did never experience this problem.
If I see the phenomenon again I will use strace or something similar to
find out what the backend process is doing.

Regards,
Christian

--
Deriva GmbH Tel.: +49 551 489500-42
Financial IT and Consulting Fax: +49 551 489500-91
Hans-B�ckler-Stra�e 2 http://www.deriva.de
D-37079 G�ttingen

Deriva CA Certificate: http://www.deriva.de/deriva-ca.cer

#14Scott Marlowe
scott.marlowe@gmail.com
In reply to: Christian Schröder (#13)
Re: (Never?) Kill Postmaster?

On 10/31/07, Christian Schröder <cs@deriva.de> wrote:

Tom Lane wrote:

Ok, you wrote "Postgres will recover automatically", but could this take
several minutes?

Yeah, potentially. I don't suppose you have any idea how long it'd been
since your last checkpoint, but what do you have checkpoint_timeout and
checkpoint_segments set to?

I did not change these parameters from their default values, so
checkpoint_timeout is 5 min and checkpoint_segments is 8.

What I'd like to know about is why the child process was unresponsive to
SIGINT in the first place. There's little we can do about long-running
plpython functions, for instance, but if it was looping in Postgres
proper then we should do something about that. Can you reproduce this
problem easily?

Unfortunately not. I have tried the same query and it took only about 1
sec to complete. In fact, it's a simple seq scan with a single filter
condition. No user defined functions are involved.
Maybe it has something to do with the users connecting from their
Windows machines to the PostgreSQL server using psqlodbc. On the other
hand, it has not been the first time that such a user connection had to
be terminated and we did never experience this problem.
If I see the phenomenon again I will use strace or something similar to
find out what the backend process is doing.

Tom, is it possible the backend was doing something that couldn't be
immediately interrupted, like a long wait on IO or something?

#15Tom Lane
tgl@sss.pgh.pa.us
In reply to: Scott Marlowe (#14)
Re: (Never?) Kill Postmaster?

"Scott Marlowe" <scott.marlowe@gmail.com> writes:

Tom, is it possible the backend was doing something that couldn't be
immediately interrupted, like a long wait on IO or something?

Sherlock Holmes said that theorizing in advance of the data is a capital
mistake...

What we can be reasonably certain of is that that backend wasn't
reaching any CHECK_FOR_INTERRUPTS() macros. Whether it was hung up
waiting for something, or caught in a tight loop somewhere, is
impossible to say without more data than we have. AFAIR the OP didn't
even mention whether the backend appeared to be consuming CPU cycles
(which'd be a pretty fair tip about which of those to believe, but still
not enough to guess *where* the problem is). A gdb backtrace would tell
us more.

regards, tom lane

In reply to: Tom Lane (#15)
Re: (Never?) Kill Postmaster?

Tom Lane wrote:

What we can be reasonably certain of is that that backend wasn't
reaching any CHECK_FOR_INTERRUPTS() macros. Whether it was hung up
waiting for something, or caught in a tight loop somewhere, is
impossible to say without more data than we have. AFAIR the OP didn't
even mention whether the backend appeared to be consuming CPU cycles
(which'd be a pretty fair tip about which of those to believe, but still
not enough to guess *where* the problem is). A gdb backtrace would tell
us more.

It happened again! I'm not sure if I should be happy because we can now
maybe find the cause of the problem, or should be worried because it's
our productive database ... At least the process doesn't seem to consume
cpu (it doesn't show up in "top"), so I won't kill it this time, but
instead try to get all information that you guys need.
What I already did was an strace with the following result:

db2:/home/pgsql/data # strace -p 7129
Process 7129 attached - interrupt to quit
futex(0x994000, FUTEX_WAIT, 2, NULL)    = -1 EINTR (Interrupted system call)
--- SIGINT (Interrupt) @ 0 (0) ---
rt_sigreturn(0x2)                       = -1 EINTR (Interrupted system call)
futex(0x994000, FUTEX_WAIT, 2, NULL

That interrupt will have been the script that tries to remove long-time
queries. The same lines seem to repeat over and over again.

Then I attached a gdb to the process and printed out a backtrace:

db2:/home/pgsql/data # gdb --pid=7129
GNU gdb 6.5
Copyright (C) 2006 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain
conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB. Type "show warranty" for details.
This GDB was configured as "x86_64-suse-linux".
Attaching to process 7129
Reading symbols from /usr/local/pgsql_8.2.5/bin/postgres...done.
Using host libthread_db library "/lib64/libthread_db.so.1".
Reading symbols from /lib64/libcrypt.so.1...done.
Loaded symbols for /lib64/libcrypt.so.1
Reading symbols from /lib64/libdl.so.2...done.
Loaded symbols for /lib64/libdl.so.2
Reading symbols from /lib64/libm.so.6...done.
Loaded symbols for /lib64/libm.so.6
Reading symbols from /lib64/libc.so.6...done.
Loaded symbols for /lib64/libc.so.6
Reading symbols from /lib64/ld-linux-x86-64.so.2...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Reading symbols from /usr/local/pgsql_8.2.5/lib/plpgsql.so...done.
Loaded symbols for /usr/local/pgsql_8.2.5/lib/plpgsql.so
Reading symbols from /usr/local/pgsql_8.2.5/lib/plperl.so...done.
Loaded symbols for /usr/local/pgsql_8.2.5/lib/plperl.so
Reading symbols from
/usr/lib/perl5/5.8.8/x86_64-linux-thread-multi/CORE/libperl.so...done.
Loaded symbols for
/usr/lib/perl5/5.8.8/x86_64-linux-thread-multi/CORE/libperl.so
Reading symbols from /lib64/libpthread.so.0...done.
[Thread debugging using libthread_db enabled]
[New Thread 47248855881456 (LWP 7129)]
Loaded symbols for /lib64/libpthread.so.0
Reading symbols from
/usr/lib/perl5/5.8.8/x86_64-linux-thread-multi/auto/Opcode/Opcode.so...done.
Loaded symbols for
/usr/lib/perl5/5.8.8/x86_64-linux-thread-multi/auto/Opcode/Opcode.so
0x00002af904809a68 in __lll_mutex_lock_wait () from /lib64/libpthread.so.0
(gdb) bt
#0 0x00002af904809a68 in __lll_mutex_lock_wait () from
/lib64/libpthread.so.0
#1 0x00002af904806e88 in pthread_rwlock_rdlock () from
/lib64/libpthread.so.0
#2 0x00002af8fb13de23 in _nl_find_msg () from /lib64/libc.so.6
#3 0x00002af8fb13ec83 in __dcigettext () from /lib64/libc.so.6
#4 0x00002af8fb186f0b in strerror_r () from /lib64/libc.so.6
#5 0x00002af8fb186d33 in strerror () from /lib64/libc.so.6
#6 0x00000000005f4daa in expand_fmt_string ()
#7 0x00000000005f6d14 in errmsg ()
#8 0x00000000005182cc in internal_flush ()
#9 0x00000000005183b6 in internal_putbytes ()
#10 0x000000000051841c in pq_putmessage ()
#11 0x00000000005199c4 in pq_endmessage ()
#12 0x0000000000440c6a in printtup ()
#13 0x00000000004fc1b8 in ExecutorRun ()
#14 0x0000000000580451 in PortalRunSelect ()
#15 0x0000000000581446 in PortalRun ()
#16 0x000000000057d625 in exec_simple_query ()
#17 0x000000000057ea72 in PostgresMain ()
#18 0x0000000000558218 in ServerLoop ()
#19 0x0000000000558db8 in PostmasterMain ()
#20 0x000000000051a213 in main ()

Do you need anything else? Can you still tell what's happening?

Regards,
Christian

--
Deriva GmbH Tel.: +49 551 489500-42
Financial IT and Consulting Fax: +49 551 489500-91
Hans-B�ckler-Stra�e 2 http://www.deriva.de
D-37079 G�ttingen

Deriva CA Certificate: http://www.deriva.de/deriva-ca.cer

#17Tom Lane
tgl@sss.pgh.pa.us
In reply to: Christian Schröder (#16)
Re: (Never?) Kill Postmaster?

=?ISO-8859-1?Q?Christian_Schr=F6der?= <cs@deriva.de> writes:

db2:/home/pgsql/data # gdb --pid=7129
GNU gdb 6.5
Copyright (C) 2006 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain
conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB. Type "show warranty" for details.
This GDB was configured as "x86_64-suse-linux".
Attaching to process 7129
Reading symbols from /usr/local/pgsql_8.2.5/bin/postgres...done.
Using host libthread_db library "/lib64/libthread_db.so.1".
Reading symbols from /lib64/libcrypt.so.1...done.
Loaded symbols for /lib64/libcrypt.so.1
Reading symbols from /lib64/libdl.so.2...done.
Loaded symbols for /lib64/libdl.so.2
Reading symbols from /lib64/libm.so.6...done.
Loaded symbols for /lib64/libm.so.6
Reading symbols from /lib64/libc.so.6...done.
Loaded symbols for /lib64/libc.so.6
Reading symbols from /lib64/ld-linux-x86-64.so.2...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Reading symbols from /usr/local/pgsql_8.2.5/lib/plpgsql.so...done.
Loaded symbols for /usr/local/pgsql_8.2.5/lib/plpgsql.so
Reading symbols from /usr/local/pgsql_8.2.5/lib/plperl.so...done.
Loaded symbols for /usr/local/pgsql_8.2.5/lib/plperl.so
Reading symbols from
/usr/lib/perl5/5.8.8/x86_64-linux-thread-multi/CORE/libperl.so...done.
Loaded symbols for
/usr/lib/perl5/5.8.8/x86_64-linux-thread-multi/CORE/libperl.so
Reading symbols from /lib64/libpthread.so.0...done.
[Thread debugging using libthread_db enabled]
[New Thread 47248855881456 (LWP 7129)]
Loaded symbols for /lib64/libpthread.so.0
Reading symbols from
/usr/lib/perl5/5.8.8/x86_64-linux-thread-multi/auto/Opcode/Opcode.so...done.
Loaded symbols for
/usr/lib/perl5/5.8.8/x86_64-linux-thread-multi/auto/Opcode/Opcode.so
0x00002af904809a68 in __lll_mutex_lock_wait () from /lib64/libpthread.so.0
(gdb) bt
#0 0x00002af904809a68 in __lll_mutex_lock_wait () from
/lib64/libpthread.so.0
#1 0x00002af904806e88 in pthread_rwlock_rdlock () from
/lib64/libpthread.so.0
#2 0x00002af8fb13de23 in _nl_find_msg () from /lib64/libc.so.6
#3 0x00002af8fb13ec83 in __dcigettext () from /lib64/libc.so.6
#4 0x00002af8fb186f0b in strerror_r () from /lib64/libc.so.6
#5 0x00002af8fb186d33 in strerror () from /lib64/libc.so.6
#6 0x00000000005f4daa in expand_fmt_string ()
#7 0x00000000005f6d14 in errmsg ()
#8 0x00000000005182cc in internal_flush ()
#9 0x00000000005183b6 in internal_putbytes ()
#10 0x000000000051841c in pq_putmessage ()
#11 0x00000000005199c4 in pq_endmessage ()
#12 0x0000000000440c6a in printtup ()
#13 0x00000000004fc1b8 in ExecutorRun ()
#14 0x0000000000580451 in PortalRunSelect ()
#15 0x0000000000581446 in PortalRun ()
#16 0x000000000057d625 in exec_simple_query ()
#17 0x000000000057ea72 in PostgresMain ()
#18 0x0000000000558218 in ServerLoop ()
#19 0x0000000000558db8 in PostmasterMain ()
#20 0x000000000051a213 in main ()

Too bad you don't have debug symbols :-(. However, this does raise a
couple of interesting questions:

* The only place internal_flush would call errmsg is here:

ereport(COMMERROR,
(errcode_for_socket_access(),
errmsg("could not send data to client: %m")));

So why is it unable to send data to the client?

* How the heck would strerror() be blocking on a thread mutex, when
the backend is single-threaded?

We recently discovered that it was possible for pltcl to cause the
backend to become multithreaded:
http://archives.postgresql.org/pgsql-patches/2007-09/msg00194.php
I see from your dump that this session has been using plperl, so I
wonder whether plperl has the same sort of issue. Can you determine
exactly what's been done with plperl in this session?

Also, can you confirm that there is actually more than one thread active
in this process? On Linux "ps -Lw" would show threads.

regards, tom lane

In reply to: Tom Lane (#17)
Re: (Never?) Kill Postmaster?

Tom Lane wrote:

* The only place internal_flush would call errmsg is here:

ereport(COMMERROR,
(errcode_for_socket_access(),
errmsg("could not send data to client: %m")));

So why is it unable to send data to the client?

The user accesses the database from his home office. He is connected to
the internet using a dial-in connection which seems to be disconnected
from time to time. The problem seems to be specific for this user, so it
may have something to do with his type of connection.

* How the heck would strerror() be blocking on a thread mutex, when
the backend is single-threaded?

We recently discovered that it was possible for pltcl to cause the
backend to become multithreaded:
http://archives.postgresql.org/pgsql-patches/2007-09/msg00194.php
I see from your dump that this session has been using plperl, so I
wonder whether plperl has the same sort of issue. Can you determine
exactly what's been done with plperl in this session?

Not exactly. There are several triggers whose functions are implemented
in perl:

elog(ERROR, 'set_serial needs at least 2 arguments.')
if $_TD->{argc} < 2;

my ($colname, $seqname) = @{$_TD->{args}};
my $result = spi_exec_query("SELECT nextval('$seqname')");
$_TD->{new}{$colname} = $result->{rows}[0]{nextval};
return 'MODIFY';

and

elog(ERROR, 'log_changes needs at least 2 arguments.')
if $_TD->{argc} < 2;

my @args = @{$_TD->{args}};

shift(@args) =~ /([^.]*(?=\.))?\.?(.*)/;
my $log_table = defined($1) ? qq("$1"."$2") : qq("$2");

if ($_TD->{event} eq 'UPDATE') {
my $id = join(',', @{$_TD->{old}}{@args});
$id =~ s/'/\\'/g;
while (my ($key, $value) = each %{$_TD->{old}}) {
if ($value ne $_TD->{new}{$key}) {
my $newvalue = $_TD->{new}{$key};
$value =~ s/'/\\'/g;
$newvalue =~ s/'/\\'/g;
my $query = sprintf(qq(INSERT INTO $log_table VALUES
(current_timestamp, session_user, '%s', '%s', '%s', '%s', '%s')),
$id,
$_TD->{relname},
$key,
$value,
$newvalue);
spi_exec_query($query);
}
}
}
else {
my $id = join(',', @{$_TD->{$_TD->{event} eq 'INSERT' ? 'new' :
'old'}}{@args});
$id =~ s/'/\\'/g;
my $query = sprintf(qq(INSERT INTO $log_table VALUES
(current_timestamp, session_user, '%s', '%s', '(%s)', null, null)),
$id,
$_TD->{relname},
$_TD->{event});
spi_exec_query($query);
}
return;

Also, can you confirm that there is actually more than one thread active
in this process? On Linux "ps -Lw" would show threads.

Since "ps -Lw" only showed me my own processes, I did "ps -AT -o
pid,ppid,lwp,nlwp,command" instead. This is the relevant line:

PID PPID LWP NLWP COMMAND
...
7129 3934 7129 1 postgres: dpyrek uspi 88.70.241.205(10612) SELECT
...

I'm not sure if this data is correct because NLWP (the number of threads
in the process) is 1 for all processes.

Is it necessary to recompile the server with debugging information
enabled? As I already mentioned this is a productive database, so I
would have to do it by night which would be somewhat uncomfortable for
me ... Not to mention the bad feeling when changing a productive
database server ...

Regards,
Christian

--
Deriva GmbH Tel.: +49 551 489500-42
Financial IT and Consulting Fax: +49 551 489500-91
Hans-B�ckler-Stra�e 2 http://www.deriva.de
D-37079 G�ttingen

Deriva CA Certificate: http://www.deriva.de/deriva-ca.cer

#19Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Christian Schröder (#18)
Re: (Never?) Kill Postmaster?

Christian Schr�der wrote:

Since "ps -Lw" only showed me my own processes, I did "ps -AT -o
pid,ppid,lwp,nlwp,command" instead. This is the relevant line:

PID PPID LWP NLWP COMMAND
...
7129 3934 7129 1 postgres: dpyrek uspi 88.70.241.205(10612) SELECT
...

I'm not sure if this data is correct because NLWP (the number of threads in
the process) is 1 for all processes.

Please try "thread apply all bt full" on gdb.

--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

In reply to: Alvaro Herrera (#19)
Re: (Never?) Kill Postmaster?

Alvaro Herrera wrote:

Please try "thread apply all bt full" on gdb.

The first lines where the symbols are loaded are of course identical.
The output of the command is in my opinion not very helpful:

(gdb) thread apply all bt full

Thread 1 (Thread 47248855881456 (LWP 7129)):
#0 0x00002af904809a68 in __lll_mutex_lock_wait () from
/lib64/libpthread.so.0
No symbol table info available.
#1 0x00002af904806e88 in pthread_rwlock_rdlock () from
/lib64/libpthread.so.0
No symbol table info available.
#2 0x00002af8fb13de23 in _nl_find_msg () from /lib64/libc.so.6
No symbol table info available.
#3 0x00002af8fb13ec83 in __dcigettext () from /lib64/libc.so.6
No symbol table info available.
#4 0x00002af8fb186f0b in strerror_r () from /lib64/libc.so.6
No symbol table info available.
#5 0x00002af8fb186d33 in strerror () from /lib64/libc.so.6
No symbol table info available.
#6 0x00000000005f4daa in expand_fmt_string ()
No symbol table info available.
#7 0x00000000005f6d14 in errmsg ()
No symbol table info available.
#8 0x00000000005182cc in internal_flush ()
No symbol table info available.
#9 0x00000000005183b6 in internal_putbytes ()
No symbol table info available.
#10 0x000000000051841c in pq_putmessage ()
No symbol table info available.
#11 0x00000000005199c4 in pq_endmessage ()
No symbol table info available.
#12 0x0000000000440c6a in printtup ()
No symbol table info available.
#13 0x00000000004fc1b8 in ExecutorRun ()
No symbol table info available.
#14 0x0000000000580451 in PortalRunSelect ()
No symbol table info available.
#15 0x0000000000581446 in PortalRun ()
No symbol table info available.
#16 0x000000000057d625 in exec_simple_query ()
No symbol table info available.
#17 0x000000000057ea72 in PostgresMain ()
No symbol table info available.
#18 0x0000000000558218 in ServerLoop ()
No symbol table info available.
#19 0x0000000000558db8 in PostmasterMain ()
No symbol table info available.
#20 0x000000000051a213 in main ()
No symbol table info available.
#0 0x00002af904809a68 in __lll_mutex_lock_wait () from
/lib64/libpthread.so.0

What does this tell you?

Regards,
Christian

--
Deriva GmbH Tel.: +49 551 489500-42
Financial IT and Consulting Fax: +49 551 489500-91
Hans-B�ckler-Stra�e 2 http://www.deriva.de
D-37079 G�ttingen

Deriva CA Certificate: http://www.deriva.de/deriva-ca.cer

#21Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Christian Schröder (#20)
In reply to: Alvaro Herrera (#21)
#23Tom Lane
tgl@sss.pgh.pa.us
In reply to: Christian Schröder (#22)
#24Tom Lane
tgl@sss.pgh.pa.us
In reply to: Christian Schröder (#22)
In reply to: Tom Lane (#23)
In reply to: Tom Lane (#24)
#27Tom Lane
tgl@sss.pgh.pa.us
In reply to: Christian Schröder (#16)
In reply to: Tom Lane (#27)
#29Tom Lane
tgl@sss.pgh.pa.us
In reply to: Christian Schröder (#28)
In reply to: Tom Lane (#29)
#31Tom Lane
tgl@sss.pgh.pa.us
In reply to: Christian Schröder (#30)
In reply to: Tom Lane (#31)
#33Martijn van Oosterhout
kleptog@svana.org
In reply to: Christian Schröder (#32)
#34Tom Lane
tgl@sss.pgh.pa.us
In reply to: Christian Schröder (#32)
In reply to: Tom Lane (#34)
#36Tom Lane
tgl@sss.pgh.pa.us
In reply to: Stefan Schwarzer (#1)
In reply to: Tom Lane (#36)
#38Martijn van Oosterhout
kleptog@svana.org
In reply to: Christian Schröder (#37)
#39Tom Lane
tgl@sss.pgh.pa.us
In reply to: Christian Schröder (#37)