"PANIC: cannot make new WAL entries during recovery" in the wild
Today we got a report in the spanish list about the message in $subject.
The server is 8.4 running on Windows.
Any ideas? I'm wondering what kind of diagnostics we can run to debug
the problem. xlogdump perhaps?
--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Alvaro Herrera wrote:
2009-08-05 11:58:19 COTLOG: el sistema de bases de datos fue
interrumpido durante la recuperaci�n en 2009-08-05 11:12:14 COT
2009-08-05 11:58:19 COTHINT: Esto probablemente significa que
algunos datos est�n corruptos y tendr� que usar el respaldo m�s
reciente para la recuperaci�n.
2009-08-05 11:58:19 COTLOG: el sistema de bases de datos no fue
apagado apropiadamente; se est� efectuando la recuperaci�n
autom�tica
2009-08-05 11:58:19 COTLOG: redo comienza en 0/75334970
2009-08-05 11:58:19 COTLOG: la direcci�n de p�gina 0/6D374000 en el
archivo de registro 0, segmento 117, posici�n 3620864 es inesperada
2009-08-05 11:58:19 COTLOG: redo listo en 0/75370F68
2009-08-05 11:58:19 COTLOG: �ltima transacci�n completada al tiempo
de registro 2009-08-05 09:45:59.796-05
2009-08-05 11:58:19 COTPANIC: cannot make new WAL entries during recovery
After adding %p to the log_line_prefix, it becomes clear that the
process calling XLogInsert here is the startup process.
This is new code in 8.4. Is no one concerned about this?
--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.
Alvaro Herrera <alvherre@commandprompt.com> writes:
This is new code in 8.4. Is no one concerned about this?
[ shrug... ] It's uninvestigatable with only this amount of detail.
How about a test case, or at least a backtrace?
regards, tom lane
2009/8/6 Alvaro Herrera <alvherre@commandprompt.com>:
After adding %p to the log_line_prefix, it becomes clear that the
process calling XLogInsert here is the startup process.This is new code in 8.4. Is no one concerned about this?
Can you get a backtrace?
Greg Stark wrote:
2009/8/6 Alvaro Herrera <alvherre@commandprompt.com>:
After adding %p to the log_line_prefix, it becomes clear that the
process calling XLogInsert here is the startup process.This is new code in 8.4. �Is no one concerned about this?
Can you get a backtrace?
I'll ask the user.
--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.
2009/8/6 Alvaro Herrera <alvherre@commandprompt.com>:
2009-08-05 11:58:19 COTLOG: la dirección de página 0/6D374000 en el
archivo de registro 0, segmento 117, posición 3620864 es inesperada
Incidentally, Google's translate gives me the impression that the
above message corresponds to:
if (!XLByteEQ(hdr->xlp_pageaddr, recaddr))
{
ereport(emode,
(errmsg("unexpected pageaddr %X/%X in log file %u, segment %u, offset %u",
hdr->xlp_pageaddr.xlogid, hdr->xlp_pageaddr.xrecoff,
readId, readSeg, readOff)));
return false;
}
I'm not sure what this indicates though.
Greg Stark <gsstark@mit.edu> writes:
2009/8/6 Alvaro Herrera <alvherre@commandprompt.com>:
2009-08-05 11:58:19 COTLOG: �la direcci�n de p�gina 0/6D374000 en el
archivo de registro 0, segmento 117, posici�n 3620864 es inesperada
Incidentally, Google's translate gives me the impression that the
above message corresponds to:
if (!XLByteEQ(hdr->xlp_pageaddr, recaddr))
{
ereport(emode,
(errmsg("unexpected pageaddr %X/%X in log file %u, segment %u, offset %u",
hdr->xlp_pageaddr.xlogid, hdr->xlp_pageaddr.xrecoff,
readId, readSeg, readOff)));
return false;
}
I'm not sure what this indicates though.
It probably means nothing: this is a typical termination condition when
reading from a recycled WAL file. You come to a page that doesn't have
the expected page address, because what it's got is whatever it had in
the WAL file's previous cycle of life.
The PANIC is evidently happening during post-WAL-scanning cleanup,
perhaps while we're trying to repair incomplete btree page splits
or some such. But I don't want to speculate too much without a
stack trace.
regards, tom lane
Greg Stark wrote:
2009/8/6 Alvaro Herrera <alvherre@commandprompt.com>:
2009-08-05 11:58:19 COTLOG: �la direcci�n de p�gina 0/6D374000 en el
archivo de registro 0, segmento 117, posici�n 3620864 es inesperadaIncidentally, Google's translate gives me the impression that the
above message corresponds to:if (!XLByteEQ(hdr->xlp_pageaddr, recaddr))
{
ereport(emode,
(errmsg("unexpected pageaddr %X/%X in log file %u, segment %u, offset %u",
hdr->xlp_pageaddr.xlogid, hdr->xlp_pageaddr.xrecoff,
readId, readSeg, readOff)));
return false;
}
Yes, that's correct. (It would be great to have a system to
reverse-translate the messages based on the PO files ...)
--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.
Alvaro Herrera <alvherre@commandprompt.com> writes:
Today we got a report in the spanish list about the message in $subject.
The server is 8.4 running on Windows.
I accidentally managed to reproduce this in HEAD just now, by kill -9'ing
a backend that was in the midst of a COPY IN operation (I was trying to
reproduce Neil Best's unrelated issue...) The server log is
LOG: server process (PID 23846) was terminated by signal 9
LOG: terminating any other active server processes
LOG: all server processes terminated; reinitializing
LOG: database system was interrupted; last known up at 2009-08-07 11:27:36 EDT
LOG: database system was not properly shut down; automatic recovery in progress
LOG: redo starts at 0/1B9D7790
LOG: unexpected pageaddr 0/1532E000 in log file 0, segment 28, offset 3334144
LOG: redo done at 0/1C32D200
PANIC: cannot make new WAL entries during recovery
LOG: startup process (PID 23883) was terminated by signal 6
LOG: aborting startup due to startup process failure
and the stack trace of the panic'd startup process looks like
#4 0x4b6e20 in errfinish (dummy=1) at elog.c:503
#5 0x4b86a0 in elog_finish (elevel=1073803952, fmt=0x7b0394b0 "") at elog.c:1142
#6 0x1f722c in XLogInsert (rmid=11 '\013', info=114 'r', rdata=0xc004d07c) at xlog.c:555
#7 0x1df290 in _bt_insertonpg (rel=0x4006cf28, buf=70, stack=0x3, itup=0x4006d150, newitemoff=38,
split_only_page=0) at nbtinsert.c:833
#8 0x1e0898 in _bt_insert_parent (rel=0x4006cf28, buf=304, rbuf=854, stack=0x7b03b9d8, is_root=0, is_only=0)
at nbtinsert.c:1627
#9 0x1ef098 in btree_xlog_cleanup () at nbtxlog.c:927
#10 0x201c44 in StartupXLOG () at xlog.c:5767
#11 0x206134 in StartupProcessMain () at xlog.c:8034
#12 0x228d0c in AuxiliaryProcessMain (argc=2, argv=0x7b03b6d8) at bootstrap.c:433
#13 0x39bb68 in StartChildProcess (type=StartupProcess) at postmaster.c:4243
So that confirms my speculation that btree index cleanup is the source
of the message. We have two basic approaches to dealing with it:
1. Decide that the check added to XLogInsert is wrong and take it out.
2. Arrange for some sort of explicit state transition between the
WAL-reading and cleanup phases of recovery, and make sure XLogInsert
knows about it.
Thoughts?
regards, tom lane
Tom Lane wrote:
Alvaro Herrera <alvherre@commandprompt.com> writes:
Today we got a report in the spanish list about the message in $subject.
The server is 8.4 running on Windows.I accidentally managed to reproduce this in HEAD just now, by kill -9'ing
a backend that was in the midst of a COPY IN operation (I was trying to
reproduce Neil Best's unrelated issue...) The server log is
You're lucky. I once tried to trigger the rm_cleanup() code with
repeated "killall -9 postmaster", but failed. IIRC I just put an abort()
at the right place in the end.
So that confirms my speculation that btree index cleanup is the source
of the message. We have two basic approaches to dealing with it:1. Decide that the check added to XLogInsert is wrong and take it out.
2. Arrange for some sort of explicit state transition between the
WAL-reading and cleanup phases of recovery, and make sure XLogInsert
knows about it.
I'd suggest we temporarily allow XLog insertion by calling
LocalSetXLogInsertAllowed() just before the rm_cleanup() loop, and reset
it with "LocalXLogInsertAllowed = -1" just after the loop. Like we do at
the startup checkpoint. The sanity check still feels very useful to me,
I'd hate to lose it.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
Tom Lane wrote:
So that confirms my speculation that btree index cleanup is the source
of the message. We have two basic approaches to dealing with it:1. Decide that the check added to XLogInsert is wrong and take it out.
2. Arrange for some sort of explicit state transition between the
WAL-reading and cleanup phases of recovery, and make sure XLogInsert
knows about it.
I'd suggest we temporarily allow XLog insertion by calling
LocalSetXLogInsertAllowed() just before the rm_cleanup() loop, and reset
it with "LocalXLogInsertAllowed = -1" just after the loop. Like we do at
the startup checkpoint. The sanity check still feels very useful to me,
I'd hate to lose it.
Yeah, that looks like a sane solution. The disturbing thing is that
we didn't catch this sooner.
regards, tom lane