How to repair Pg 9.1?

Started by Dmitry E. Oboukhovalmost 14 years ago6 messagesgeneral
Jump to latest
#1Dmitry E. Oboukhov
unera@debian.org

Recently I started a hard query (ALTER TABLE) and then Postgresql was
killed by OOM killer.

Since that clients lose their connection and Postgresql says in log:

2012-04-15 16:46:02 MSK LOG: server process (PID 825) was terminated by signal 11: Segmentation fault
2012-04-15 16:46:02 MSK LOG: terminating any other active server processes
2012-04-15 16:46:02 MSK WARNING: terminating connection because of crash of another server process
2012-04-15 16:46:02 MSK DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2012-04-15 16:46:02 MSK HINT: In a moment you should be able to reconnect to the database and repeat your command.
2012-04-15 16:46:02 MSK WARNING: terminating connection because of crash of another server process
2012-04-15 16:46:02 MSK DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2012-04-15 16:46:02 MSK HINT: In a moment you should be able to reconnect to the database and repeat your command.
2012-04-15 16:46:02 MSK WARNING: terminating connection because of crash of another server process
2012-04-15 16:46:02 MSK DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2012-04-15 16:46:02 MSK HINT: In a moment you should be able to reconnect to the database and repeat your command.
2012-04-15 16:46:02 MSK LOG: all server processes terminated; reinitializing
2012-04-15 16:46:02 MSK LOG: database system was interrupted; last known up at 2012-04-15 16:45:17 MSK
2012-04-15 16:46:02 MSK LOG: database system was not properly shut down; automatic recovery in progress
2012-04-15 16:46:02 MSK LOG: consistent recovery state reached at 58/21C44D18
2012-04-15 16:46:02 MSK LOG: redo starts at 58/21C44D18
2012-04-15 16:46:02 MSK LOG: record with zero length at 58/21CA7E20
2012-04-15 16:46:02 MSK LOG: redo done at 58/21CA7DE0
2012-04-15 16:46:02 MSK LOG: last completed transaction was at log time 2012-04-15 16:46:02.620188+04
2012-04-15 16:46:02 MSK LOG: autovacuum launcher started
2012-04-15 16:46:02 MSK LOG: database system is ready to accept connections

What can I do in the situation?

Postgresql 9.1/Debian wheezy
--

. ''`. Dmitry E. Oboukhov
: :’ : email: unera@debian.org jabber://UNera@uvw.ru
`. `~’ GPGKey: 1024D / F8E26537 2006-11-21
`- 1B23 D4F8 8EC0 D902 0555 E438 AB8C 00CF F8E2 6537

#2Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Dmitry E. Oboukhov (#1)
Re: How to repair Pg 9.1?

On 15.4.2012 14:52, Dmitry E. Oboukhov wrote:

Recently I started a hard query (ALTER TABLE) and then Postgresql was
killed by OOM killer.

Since that clients lose their connection and Postgresql says in log:

2012-04-15 16:46:02 MSK LOG: server process (PID 825) was terminated by signal 11: Segmentation fault
2012-04-15 16:46:02 MSK LOG: terminating any other active server processes
2012-04-15 16:46:02 MSK WARNING: terminating connection because of crash of another server process
2012-04-15 16:46:02 MSK DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2012-04-15 16:46:02 MSK HINT: In a moment you should be able to reconnect to the database and repeat your command.
2012-04-15 16:46:02 MSK WARNING: terminating connection because of crash of another server process
2012-04-15 16:46:02 MSK DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2012-04-15 16:46:02 MSK HINT: In a moment you should be able to reconnect to the database and repeat your command.
2012-04-15 16:46:02 MSK WARNING: terminating connection because of crash of another server process
2012-04-15 16:46:02 MSK DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2012-04-15 16:46:02 MSK HINT: In a moment you should be able to reconnect to the database and repeat your command.
2012-04-15 16:46:02 MSK LOG: all server processes terminated; reinitializing
2012-04-15 16:46:02 MSK LOG: database system was interrupted; last known up at 2012-04-15 16:45:17 MSK
2012-04-15 16:46:02 MSK LOG: database system was not properly shut down; automatic recovery in progress
2012-04-15 16:46:02 MSK LOG: consistent recovery state reached at 58/21C44D18
2012-04-15 16:46:02 MSK LOG: redo starts at 58/21C44D18
2012-04-15 16:46:02 MSK LOG: record with zero length at 58/21CA7E20
2012-04-15 16:46:02 MSK LOG: redo done at 58/21CA7DE0
2012-04-15 16:46:02 MSK LOG: last completed transaction was at log time 2012-04-15 16:46:02.620188+04
2012-04-15 16:46:02 MSK LOG: autovacuum launcher started
2012-04-15 16:46:02 MSK LOG: database system is ready to accept connections

What can I do in the situation?

Well, what happened is quite clear. One of the processes was terminated
by the OOM killer. The database noticed this and responded properly by
restarting the database. That caused a proper recovery - you don't need
to "repair" the database.

What you need to do is change the machine configuration so that the OOM
errors do not happen. I don't know what's running on the machine, but
this usually means tweaking the vm.* parameters at the kernel level and
shared_buffers/work_mem/maintenance_work_mem at the PostgreSQL level.

In this particular situation (ALTER TABLE) I'd use a low value for the
maintenance_work_mem parameter - it'll run longer but won't need that
much memory.

Tomas

#3Dmitry E. Oboukhov
unera@debian.org
In reply to: Tomas Vondra (#2)
Re: How to repair Pg 9.1?

2012-04-15 16:46:02 MSK LOG: server process (PID 825) was terminated by signal 11: Segmentation fault

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

What can I do in the situation?

Well, what happened is quite clear. One of the processes was terminated
by the OOM killer. The database noticed this and responded properly by
restarting the database. That caused a proper recovery - you don't need
to "repair" the database.

What you need to do is change the machine configuration so that the OOM
errors do not happen. I don't know what's running on the machine, but
this usually means tweaking the vm.* parameters at the kernel level and
shared_buffers/work_mem/maintenance_work_mem at the PostgreSQL level.

In this particular situation (ALTER TABLE) I'd use a low value for the
maintenance_work_mem parameter - it'll run longer but won't need that
much memory.

Tomas

I've just upgraded my postgresql upto current wheezy version
(9.1.3-2/Debian): I wanted to get backtrace so it was upgraded when I
installed postgresql-dbg package. After upgrading it discontinued
segfault and began work fine.
:)
--

. ''`. Dmitry E. Oboukhov
: :’ : email: unera@debian.org jabber://UNera@uvw.ru
`. `~’ GPGKey: 1024D / F8E26537 2006-11-21
`- 1B23 D4F8 8EC0 D902 0555 E438 AB8C 00CF F8E2 6537

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Dmitry E. Oboukhov (#1)
Re: How to repair Pg 9.1?

"Dmitry E. Oboukhov" <unera@debian.org> writes:

Recently I started a hard query (ALTER TABLE) and then Postgresql was
killed by OOM killer.
...
2012-04-15 16:46:02 MSK LOG: server process (PID 825) was terminated by signal 11: Segmentation fault

You sure it was an OOM kill? The OOM killer does its dirty work with
signal 9 (SIGKILL), not signal 11 (SIGSEGV).

regards, tom lane

#5Dmitry E. Oboukhov
unera@debian.org
In reply to: Tom Lane (#4)
Re: How to repair Pg 9.1?

Recently I started a hard query (ALTER TABLE) and then Postgresql was
killed by OOM killer.
...
2012-04-15 16:46:02 MSK LOG: server process (PID 825) was terminated by signal 11: Segmentation fault

You sure it was an OOM kill? The OOM killer does its dirty work with
signal 9 (SIGKILL), not signal 11 (SIGSEGV).

OOM killer was when I did 'ALTER TABLE' (in the first time),
the log was since the time for each client's connecting.

Sorry, my bad English, I could not explain cleanly :)
--

. ''`. Dmitry E. Oboukhov
: :’ : email: unera@debian.org jabber://UNera@uvw.ru
`. `~’ GPGKey: 1024D / F8E26537 2006-11-21
`- 1B23 D4F8 8EC0 D902 0555 E438 AB8C 00CF F8E2 6537

#6Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Dmitry E. Oboukhov (#3)
Re: How to repair Pg 9.1?

On 15.4.2012 16:16, Dmitry E. Oboukhov wrote:

2012-04-15 16:46:02 MSK LOG: server process (PID 825) was terminated by signal 11: Segmentation fault

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

What can I do in the situation?

Well, what happened is quite clear. One of the processes was terminated
by the OOM killer. The database noticed this and responded properly by
restarting the database. That caused a proper recovery - you don't need
to "repair" the database.

What you need to do is change the machine configuration so that the OOM
errors do not happen. I don't know what's running on the machine, but
this usually means tweaking the vm.* parameters at the kernel level and
shared_buffers/work_mem/maintenance_work_mem at the PostgreSQL level.

In this particular situation (ALTER TABLE) I'd use a low value for the
maintenance_work_mem parameter - it'll run longer but won't need that
much memory.

Tomas

I've just upgraded my postgresql upto current wheezy version
(9.1.3-2/Debian): I wanted to get backtrace so it was upgraded when I
installed postgresql-dbg package. After upgrading it discontinued
segfault and began work fine.
:)

I haven't noticed it was a segfault - I kinda believed you it actually
was an OOM kill, but that does not cause segfaults. That's a whole
different story ...

It you were getting that whenever you run the ALTER TABLE and it stopped
after upgrading to 9.1.3, then there was something wrong in the version
you've been using before and it got fixed in 9.1.3. Or maybe it still
exists but it does not happen so often.

It'd be nice to know what version were you using and a detailed
description of what you've done - table structure, size, exact ALTER
TABLE command etc.

Tomas