UTC4115FATAL: the database system is in recovery mode

Started by Mathew Samuelabout 15 years ago4 messagesbugs

Mathew.Samuel@entrust.com

about 15 years ago

Hi,

I see the following error as found in pg.log:
UTC4115FATAL: the database system is in recovery mode

Actually that message was logged repeatedly for about 4 hours according to the logs (I don't have access to the system itself, just the logs).

Leading up to that error were the following in pg.log:
2011-03-28 10:44:06 UTC3609LOG: checkpoints are occurring too frequently (11 seconds apart)
2011-03-28 10:44:06 UTC3609HINT: Consider increasing the configuration parameter "checkpoint_segments".
2011-03-28 10:44:18 UTC3609LOG: checkpoints are occurring too frequently (12 seconds apart)
2011-03-28 10:44:18 UTC3609HINT: Consider increasing the configuration parameter "checkpoint_segments".
2011-03-28 10:44:28 UTC3609LOG: checkpoints are occurring too frequently (10 seconds apart)
2011-03-28 10:44:28 UTC3609HINT: Consider increasing the configuration parameter "checkpoint_segments".
2011-03-28 10:44:38 UTC3609LOG: checkpoints are occurring too frequently (10 seconds apart)
2011-03-28 10:44:38 UTC3609HINT: Consider increasing the configuration parameter "checkpoint_segments".
2011-03-28 10:44:42 UTC3932ERROR: canceling statement due to statement timeout
2011-03-28 10:44:42 UTC3932STATEMENT: vacuum full analyze _zamboni.sl_log_1
2011-03-28 10:44:42 UTC3932PANIC: cannot abort transaction 1827110275, it was already committed
2011-03-28 10:44:42 UTC3566LOG: server process (PID 3932) was terminated by signal 6
2011-03-28 10:44:42 UTC3566LOG: terminating any other active server processes
2011-03-28 10:44:42 UTC13142WARNING: terminating connection because of crash of another server process
2011-03-28 10:44:42 UTC13142DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2011-03-28 10:44:42 UTC13142HINT: In a moment you should be able to reconnect to the database and repeat your command.
2011-03-28 10:44:42 UTC29834WARNING: terminating connection because of crash of another server process
2011-03-28 10:44:42 UTC29834DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2011-03-28 10:44:42 UTC29834HINT: In a moment you should be able to reconnect to the database and repeat your command.
2011-03-28 10:44:42 UTC3553WARNING: terminating connection because of crash of another server process
2011-03-28 10:44:42 UTC3553DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
In fact those last 3 lines are repeated over and over again repeatedly until "UTC4115FATAL: the database system is in recovery mode" is logged for 4 hours. At some point, 4 hours later of course, it appears that the system recovers.

The Checkpoints Settings in postgresql.conf are commented out so I guess the defaults are being used:
#checkpoint_segments = 3 # in logfile segments, min 1, 16MB each
#checkpoint_timeout = 5min # range 30s-1h
#checkpoint_warning = 30s # 0 is off
That system where this was seen was using pgsql-8.2.6 on RHEL4.

Not sure if this is a known bug (or if it is a bug at all or something I can address using different configuration) but I thought I would post here first if any one might be familiar with this issue and suggest a possible solution. Any ideas?

Cheers,
Matt

Robert Haas

robertmhaas@gmail.com

almost 15 years ago

In reply to: Mathew Samuel (#1)

Re: UTC4115FATAL: the database system is in recovery mode

On Fri, May 27, 2011 at 2:26 PM, Mathew Samuel
<Mathew.Samuel@entrust.com> wrote:

I see the following error as found in pg.log:
UTC4115FATAL: the database system is in recovery mode

Actually that message was logged repeatedly for about 4 hours according to
the logs (I don't have access to the system itself, just the logs).

Leading up to that error were the following in pg.log:
2011-03-28 10:44:06 UTC3609LOG: checkpoints are occurring too frequently
(11 seconds apart)
2011-03-28 10:44:06 UTC3609HINT: Consider increasing the configuration
parameter "checkpoint_segments".
2011-03-28 10:44:18 UTC3609LOG: checkpoints are occurring too frequently
(12 seconds apart)
2011-03-28 10:44:18 UTC3609HINT: Consider increasing the configuration
parameter "checkpoint_segments".
2011-03-28 10:44:28 UTC3609LOG: checkpoints are occurring too frequently
(10 seconds apart)
2011-03-28 10:44:28 UTC3609HINT: Consider increasing the configuration
parameter "checkpoint_segments".
2011-03-28 10:44:38 UTC3609LOG: checkpoints are occurring too frequently
(10 seconds apart)
2011-03-28 10:44:38 UTC3609HINT: Consider increasing the configuration
parameter "checkpoint_segments".
2011-03-28 10:44:42 UTC3932ERROR: canceling statement due to statement
timeout
2011-03-28 10:44:42 UTC3932STATEMENT: vacuum full analyze _zamboni.sl_log_1
2011-03-28 10:44:42 UTC3932PANIC: cannot abort transaction 1827110275, it
was already committed
2011-03-28 10:44:42 UTC3566LOG: server process (PID 3932) was terminated by
signal 6
2011-03-28 10:44:42 UTC3566LOG: terminating any other active server
processes
2011-03-28 10:44:42 UTC13142WARNING: terminating connection because of
crash of another server process
2011-03-28 10:44:42 UTC13142DETAIL: The postmaster has commanded this
server process to roll back the current transaction and exit, because
another server process exited abnormally and possibly corrupted shared
memory.
2011-03-28 10:44:42 UTC13142HINT: In a moment you should be able to
reconnect to the database and repeat your command.
2011-03-28 10:44:42 UTC29834WARNING: terminating connection because of
crash of another server process
2011-03-28 10:44:42 UTC29834DETAIL: The postmaster has commanded this
server process to roll back the current transaction and exit, because
another server process exited abnormally and possibly corrupted shared
memory.
2011-03-28 10:44:42 UTC29834HINT: In a moment you should be able to
reconnect to the database and repeat your command.
2011-03-28 10:44:42 UTC3553WARNING: terminating connection because of crash
of another server process
2011-03-28 10:44:42 UTC3553DETAIL: The postmaster has commanded this server
process to roll back the current transaction and exit, because another
server process exited abnormally and possibly corrupted shared memory.
In fact those last 3 lines are repeated over and over again repeatedly until
"UTC4115FATAL: the database system is in recovery mode" is logged for 4
hours. At some point, 4 hours later of course, it appears that the system
recovers.

The Checkpoints Settings in postgresql.conf are commented out so I guess the
defaults are being used:
#checkpoint_segments = 3                # in logfile segments, min 1, 16MB
each
#checkpoint_timeout = 5min              # range 30s-1h
#checkpoint_warning = 30s               # 0 is off
That system where this was seen was using pgsql-8.2.6 on RHEL4.

Not sure if this is a known bug (or if it is a bug at all or something I can
address using different configuration) but I thought I would post here first
if any one might be familiar with this issue and suggest a possible
solution. Any ideas?

Not really, although I've seen a few other crashes attributable to
trying to cancel VACUUM FULL. As of PostgreSQL 9.0, VACUUM FULL has
been completely rewritten, so whatever bug is causing this problem is
almost certainly gone in more recent versions. It would be nice to
track done and fix whatever is wrong in the older branches, but we'd
probably need a reproducible test case of some sort. I don't think it
craps out like that EVERY time you interrupt it...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Tom Lane

tgl@sss.pgh.pa.us

almost 15 years ago

In reply to: Robert Haas (#2)

Re: UTC4115FATAL: the database system is in recovery mode

Robert Haas <robertmhaas@gmail.com> writes:

On Fri, May 27, 2011 at 2:26 PM, Mathew Samuel
<Mathew.Samuel@entrust.com> wrote:

2011-03-28 10:44:42 UTC3932PANIC:� cannot abort transaction 1827110275, it
was already committed

Not sure if this is a known bug (or if it is a bug at all or something I can
address using different configuration) but I thought I would post here first
if any one might be familiar with this issue and suggest a possible
solution. Any ideas?

Not really, although I've seen a few other crashes attributable to
trying to cancel VACUUM FULL. As of PostgreSQL 9.0, VACUUM FULL has
been completely rewritten, so whatever bug is causing this problem is
almost certainly gone in more recent versions.

That was in fact one of the primary motivations for the rewrite.

It would be nice to
track done and fix whatever is wrong in the older branches, but we'd
probably need a reproducible test case of some sort. I don't think it
craps out like that EVERY time you interrupt it...

It does if you interrupt it after the midstream commit :-(. See around
line 2800 in commands/vacuum.c in 8.4, for example, and note that the
lame HOLD_INTERRUPTS hack isn't in the version the OP is running.

regards, tom lane

Craig Ringer

craig@2ndquadrant.com

almost 15 years ago

In reply to: Mathew Samuel (#1)

Re: UTC4115FATAL: the database system is in recovery mode

On 05/28/2011 02:26 AM, Mathew Samuel wrote:

That system where this was seen was using pgsql-8.2.6 on RHEL4.

Upgrade. A workaround is in newer versions of 8.2 .

--
Craig Ringer