"PANIC: cannot make new WAL entries during recovery" in the wild

Started by Alvaro Herreraalmost 17 years ago11 messageshackers

Jump to latest

Alvaro Herrera

alvherre@2ndquadrant.com

almost 17 years ago

Today we got a report in the spanish list about the message in $subject.
The server is 8.4 running on Windows.

Any ideas? I'm wondering what kind of diagnostics we can run to debug
the problem. xlogdump perhaps?

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Alvaro Herrera

alvherre@2ndquadrant.com

almost 17 years ago

In reply to: Alvaro Herrera (#1)

Re: "PANIC: cannot make new WAL entries during recovery" in the wild

Alvaro Herrera wrote:

2009-08-05 11:58:19 COTLOG: el sistema de bases de datos fue
interrumpido durante la recuperaciï¿½n en 2009-08-05 11:12:14 COT
2009-08-05 11:58:19 COTHINT: Esto probablemente significa que
algunos datos estï¿½n corruptos y tendrï¿½ que usar el respaldo mï¿½s
reciente para la recuperaciï¿½n.
2009-08-05 11:58:19 COTLOG: el sistema de bases de datos no fue
apagado apropiadamente; se estï¿½ efectuando la recuperaciï¿½n
automï¿½tica
2009-08-05 11:58:19 COTLOG: redo comienza en 0/75334970
2009-08-05 11:58:19 COTLOG: la direcciï¿½n de pï¿½gina 0/6D374000 en el
archivo de registro 0, segmento 117, posiciï¿½n 3620864 es inesperada
2009-08-05 11:58:19 COTLOG: redo listo en 0/75370F68
2009-08-05 11:58:19 COTLOG: ï¿½ltima transacciï¿½n completada al tiempo
de registro 2009-08-05 09:45:59.796-05
2009-08-05 11:58:19 COTPANIC: cannot make new WAL entries during recovery

After adding %p to the log_line_prefix, it becomes clear that the
process calling XLogInsert here is the startup process.

This is new code in 8.4. Is no one concerned about this?

--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Tom Lane

tgl@sss.pgh.pa.us

almost 17 years ago

In reply to: Alvaro Herrera (#2)

Re: "PANIC: cannot make new WAL entries during recovery" in the wild

Alvaro Herrera <alvherre@commandprompt.com> writes:

This is new code in 8.4. Is no one concerned about this?

[ shrug... ] It's uninvestigatable with only this amount of detail.
How about a test case, or at least a backtrace?

regards, tom lane

Bruce Momjian

bruce@momjian.us

almost 17 years ago

In reply to: Alvaro Herrera (#2)

Re: "PANIC: cannot make new WAL entries during recovery" in the wild

2009/8/6 Alvaro Herrera <alvherre@commandprompt.com>:

After adding %p to the log_line_prefix, it becomes clear that the
process calling XLogInsert here is the startup process.

This is new code in 8.4. Is no one concerned about this?

Can you get a backtrace?

--
greg
http://mit.edu/~gsstark/resume.pdf

Alvaro Herrera

alvherre@2ndquadrant.com

almost 17 years ago

In reply to: Bruce Momjian (#4)

Re: "PANIC: cannot make new WAL entries during recovery" in the wild

Greg Stark wrote:

2009/8/6 Alvaro Herrera <alvherre@commandprompt.com>:

After adding %p to the log_line_prefix, it becomes clear that the
process calling XLogInsert here is the startup process.

This is new code in 8.4. ï¿½Is no one concerned about this?

Can you get a backtrace?

I'll ask the user.

--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Bruce Momjian

bruce@momjian.us

almost 17 years ago

In reply to: Alvaro Herrera (#2)

Re: "PANIC: cannot make new WAL entries during recovery" in the wild

2009/8/6 Alvaro Herrera <alvherre@commandprompt.com>:

2009-08-05 11:58:19 COTLOG: la dirección de página 0/6D374000 en el
archivo de registro 0, segmento 117, posición 3620864 es inesperada

Incidentally, Google's translate gives me the impression that the
above message corresponds to:

if (!XLByteEQ(hdr->xlp_pageaddr, recaddr))
{
ereport(emode,
(errmsg("unexpected pageaddr %X/%X in log file %u, segment %u, offset %u",
hdr->xlp_pageaddr.xlogid, hdr->xlp_pageaddr.xrecoff,
readId, readSeg, readOff)));
return false;
}

I'm not sure what this indicates though.

--
greg
http://mit.edu/~gsstark/resume.pdf

Tom Lane

tgl@sss.pgh.pa.us

almost 17 years ago

In reply to: Bruce Momjian (#6)

Re: "PANIC: cannot make new WAL entries during recovery" in the wild

Greg Stark <gsstark@mit.edu> writes:

2009/8/6 Alvaro Herrera <alvherre@commandprompt.com>:
2009-08-05 11:58:19 COTLOG: �la direcci�n de p�gina 0/6D374000 en el
archivo de registro 0, segmento 117, posici�n 3620864 es inesperada

Incidentally, Google's translate gives me the impression that the
above message corresponds to:

if (!XLByteEQ(hdr->xlp_pageaddr, recaddr))
{
ereport(emode,
(errmsg("unexpected pageaddr %X/%X in log file %u, segment %u, offset %u",
hdr->xlp_pageaddr.xlogid, hdr->xlp_pageaddr.xrecoff,
readId, readSeg, readOff)));
return false;
}

I'm not sure what this indicates though.

It probably means nothing: this is a typical termination condition when
reading from a recycled WAL file. You come to a page that doesn't have
the expected page address, because what it's got is whatever it had in
the WAL file's previous cycle of life.

The PANIC is evidently happening during post-WAL-scanning cleanup,
perhaps while we're trying to repair incomplete btree page splits
or some such. But I don't want to speculate too much without a
stack trace.

regards, tom lane

Alvaro Herrera

alvherre@2ndquadrant.com

almost 17 years ago

In reply to: Bruce Momjian (#6)

Re: "PANIC: cannot make new WAL entries during recovery" in the wild

Greg Stark wrote:

2009/8/6 Alvaro Herrera <alvherre@commandprompt.com>:

2009-08-05 11:58:19 COTLOG: ï¿½la direcciï¿½n de pï¿½gina 0/6D374000 en el
archivo de registro 0, segmento 117, posiciï¿½n 3620864 es inesperada

Incidentally, Google's translate gives me the impression that the
above message corresponds to:

if (!XLByteEQ(hdr->xlp_pageaddr, recaddr))
{
ereport(emode,
(errmsg("unexpected pageaddr %X/%X in log file %u, segment %u, offset %u",
hdr->xlp_pageaddr.xlogid, hdr->xlp_pageaddr.xrecoff,
readId, readSeg, readOff)));
return false;
}

Yes, that's correct. (It would be great to have a system to
reverse-translate the messages based on the PO files ...)

--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Tom Lane

tgl@sss.pgh.pa.us

almost 17 years ago

In reply to: Alvaro Herrera (#1)

Re: "PANIC: cannot make new WAL entries during recovery" in the wild

Alvaro Herrera <alvherre@commandprompt.com> writes:

Today we got a report in the spanish list about the message in $subject.
The server is 8.4 running on Windows.

I accidentally managed to reproduce this in HEAD just now, by kill -9'ing
a backend that was in the midst of a COPY IN operation (I was trying to
reproduce Neil Best's unrelated issue...) The server log is

LOG: server process (PID 23846) was terminated by signal 9
LOG: terminating any other active server processes
LOG: all server processes terminated; reinitializing
LOG: database system was interrupted; last known up at 2009-08-07 11:27:36 EDT
LOG: database system was not properly shut down; automatic recovery in progress
LOG: redo starts at 0/1B9D7790
LOG: unexpected pageaddr 0/1532E000 in log file 0, segment 28, offset 3334144
LOG: redo done at 0/1C32D200
PANIC: cannot make new WAL entries during recovery
LOG: startup process (PID 23883) was terminated by signal 6
LOG: aborting startup due to startup process failure

and the stack trace of the panic'd startup process looks like

#4 0x4b6e20 in errfinish (dummy=1) at elog.c:503
#5 0x4b86a0 in elog_finish (elevel=1073803952, fmt=0x7b0394b0 "") at elog.c:1142
#6 0x1f722c in XLogInsert (rmid=11 '\013', info=114 'r', rdata=0xc004d07c) at xlog.c:555
#7 0x1df290 in _bt_insertonpg (rel=0x4006cf28, buf=70, stack=0x3, itup=0x4006d150, newitemoff=38,
split_only_page=0) at nbtinsert.c:833
#8 0x1e0898 in _bt_insert_parent (rel=0x4006cf28, buf=304, rbuf=854, stack=0x7b03b9d8, is_root=0, is_only=0)
at nbtinsert.c:1627
#9 0x1ef098 in btree_xlog_cleanup () at nbtxlog.c:927
#10 0x201c44 in StartupXLOG () at xlog.c:5767
#11 0x206134 in StartupProcessMain () at xlog.c:8034
#12 0x228d0c in AuxiliaryProcessMain (argc=2, argv=0x7b03b6d8) at bootstrap.c:433
#13 0x39bb68 in StartChildProcess (type=StartupProcess) at postmaster.c:4243

So that confirms my speculation that btree index cleanup is the source
of the message. We have two basic approaches to dealing with it:

1. Decide that the check added to XLogInsert is wrong and take it out.

2. Arrange for some sort of explicit state transition between the
WAL-reading and cleanup phases of recovery, and make sure XLogInsert
knows about it.

Thoughts?

regards, tom lane

#10

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

almost 17 years ago

In reply to: Tom Lane (#9)

Re: "PANIC: cannot make new WAL entries during recovery" in the wild

Tom Lane wrote:

Alvaro Herrera <alvherre@commandprompt.com> writes:

Today we got a report in the spanish list about the message in $subject.
The server is 8.4 running on Windows.

I accidentally managed to reproduce this in HEAD just now, by kill -9'ing
a backend that was in the midst of a COPY IN operation (I was trying to
reproduce Neil Best's unrelated issue...) The server log is

You're lucky. I once tried to trigger the rm_cleanup() code with
repeated "killall -9 postmaster", but failed. IIRC I just put an abort()
at the right place in the end.

So that confirms my speculation that btree index cleanup is the source
of the message. We have two basic approaches to dealing with it:

1. Decide that the check added to XLogInsert is wrong and take it out.

2. Arrange for some sort of explicit state transition between the
WAL-reading and cleanup phases of recovery, and make sure XLogInsert
knows about it.

I'd suggest we temporarily allow XLog insertion by calling
LocalSetXLogInsertAllowed() just before the rm_cleanup() loop, and reset
it with "LocalXLogInsertAllowed = -1" just after the loop. Like we do at
the startup checkpoint. The sanity check still feels very useful to me,
I'd hate to lose it.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#11

Tom Lane

tgl@sss.pgh.pa.us

almost 17 years ago

In reply to: Heikki Linnakangas (#10)

Re: "PANIC: cannot make new WAL entries during recovery" in the wild

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:

Tom Lane wrote:

So that confirms my speculation that btree index cleanup is the source
of the message. We have two basic approaches to dealing with it:

1. Decide that the check added to XLogInsert is wrong and take it out.

2. Arrange for some sort of explicit state transition between the
WAL-reading and cleanup phases of recovery, and make sure XLogInsert
knows about it.

I'd suggest we temporarily allow XLog insertion by calling
LocalSetXLogInsertAllowed() just before the rm_cleanup() loop, and reset
it with "LocalXLogInsertAllowed = -1" just after the loop. Like we do at
the startup checkpoint. The sanity check still feels very useful to me,
I'd hate to lose it.

Yeah, that looks like a sane solution. The disturbing thing is that
we didn't catch this sooner.

regards, tom lane