Replication terminated due to PANIC

Started by Adarsh Sharmaabout 13 years ago6 messagesgeneral

eddy.adarsh@gmail.com

about 13 years ago

Hi all,

I have a Postgresql 9.2 instance running on a CentOS6.3 box.Yesterday i
setup a hot standby by using pgbasebackup. Today i got the below alert
from standby box :

[1]: (from line 412,723) 2013-04-24 23:07:18 UTC [13445]: [6-1] user= db= host= PANIC: _bt_restore_page: cannot add item to page
2013-04-24 23:07:18 UTC [13445]: [6-1] user= db= host= PANIC:
_bt_restore_page: cannot add item to page

When i check, the replication is terminated due to slave DB shutdown. From
the logs i can see below messages :-

2013-04-24 23:17:16 UTC [26989]: [5360083-1] user= db= host= ERROR: could
not open file "global/14078": No such file or directory
2013-04-24 23:17:16 UTC [26989]: [5360084-1] user= db= host= CONTEXT:
writing block 0 of relation global/14078
2013-04-24 23:17:16 UTC [26989]: [5360085-1] user= db= host= WARNING:
could not write block 0 of global/14078
2013-04-24 23:17:16 UTC [26989]: [5360086-1] user= db= host= DETAIL:
Multiple failures --- write error might be permanent.

I checked in global directory of master, the directory 14078 doesn't exist.

Anyone has faced above issue ?

Thanks

Sergey Konoplev

gray.ru@gmail.com

about 13 years ago

In reply to: Adarsh Sharma (#1)

Re: Replication terminated due to PANIC

On Wed, Apr 24, 2013 at 5:05 PM, Adarsh Sharma <eddy.adarsh@gmail.com> wrote:

I have a Postgresql 9.2 instance running on a CentOS6.3 box.Yesterday i
setup a hot standby by using pgbasebackup. Today i got the below alert from
standby box :

[1] (from line 412,723)
2013-04-24 23:07:18 UTC [13445]: [6-1] user= db= host= PANIC:
_bt_restore_page: cannot add item to page

When i check, the replication is terminated due to slave DB shutdown. From
the logs i can see below messages :-

I am not sure that it is your situation but take a look at this thread:

/messages/by-id/CAL_0b1t=WuM6roO8dki=w8DhH8P8whhohbPjReymmQUrOcNT2A@mail.gmail.com

There is a patch by Andres Freund in the end of the discussion. Three
weeks have passed after I installed the patched version and it looks
like the patch fixed my issue.

2013-04-24 23:17:16 UTC [26989]: [5360083-1] user= db= host= ERROR: could
not open file "global/14078": No such file or directory
2013-04-24 23:17:16 UTC [26989]: [5360084-1] user= db= host= CONTEXT:
writing block 0 of relation global/14078
2013-04-24 23:17:16 UTC [26989]: [5360085-1] user= db= host= WARNING: could
not write block 0 of global/14078
2013-04-24 23:17:16 UTC [26989]: [5360086-1] user= db= host= DETAIL:
Multiple failures --- write error might be permanent.

I checked in global directory of master, the directory 14078 doesn't exist.

Anyone has faced above issue ?

Thanks

--
Kind regards,
Sergey Konoplev
Database and Software Consultant

Profile: http://www.linkedin.com/in/grayhemp
Phone: USA +1 (415) 867-9984, Russia +7 (901) 903-0499, +7 (988) 888-1979
Skype: gray-hemp
Jabber: gray.ru@gmail.com

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Adarsh Sharma

eddy.adarsh@gmail.com

about 13 years ago

In reply to: Sergey Konoplev (#2)

Re: Replication terminated due to PANIC

Thanks Sergey for such a quick response, but i dont think this is some
patch problem because we have other DB servers also running fine on same
version and message is also different :

host= PANIC: _bt_restore_page: cannot add item to page

And the whole day replication is working fine but at midnight when log
rotates it shows belows msg :

2013-04-24 00:00:00 UTC [26989]: [4945032-1] user= db= host= LOG:
checkpoint starting: time
2013-04-24 00:00:00 UTC [26989]: [4945033-1] user= db= host= ERROR:
could not open file "global/14078": No such file or directory
2013-04-24 00:00:00 UTC [26989]: [4945034-1] user= db= host= CONTEXT:
writing block 0 of relation global/14078
2013-04-24 00:00:00 UTC [26989]: [4945035-1] user= db= host= WARNING:
could not write block 0 of global/14078
2013-04-24 00:00:00 UTC [26989]: [4945036-1] user= db= host= DETAIL:
Multiple failures --- write error might be permanent.

Looks like some index corruption.

Thanks

On Thu, Apr 25, 2013 at 8:14 AM, Sergey Konoplev <gray.ru@gmail.com> wrote:

Show quoted text

On Wed, Apr 24, 2013 at 5:05 PM, Adarsh Sharma <eddy.adarsh@gmail.com>
wrote:

I have a Postgresql 9.2 instance running on a CentOS6.3 box.Yesterday i
setup a hot standby by using pgbasebackup. Today i got the below alert

from

standby box :

[1] (from line 412,723)
2013-04-24 23:07:18 UTC [13445]: [6-1] user= db= host= PANIC:
_bt_restore_page: cannot add item to page

When i check, the replication is terminated due to slave DB shutdown.

From

the logs i can see below messages :-

I am not sure that it is your situation but take a look at this thread:

/messages/by-id/CAL_0b1t=WuM6roO8dki=w8DhH8P8whhohbPjReymmQUrOcNT2A@mail.gmail.com

There is a patch by Andres Freund in the end of the discussion. Three
weeks have passed after I installed the patched version and it looks
like the patch fixed my issue.

2013-04-24 23:17:16 UTC [26989]: [5360083-1] user= db= host= ERROR:

could

not open file "global/14078": No such file or directory
2013-04-24 23:17:16 UTC [26989]: [5360084-1] user= db= host= CONTEXT:
writing block 0 of relation global/14078
2013-04-24 23:17:16 UTC [26989]: [5360085-1] user= db= host= WARNING:

could

not write block 0 of global/14078
2013-04-24 23:17:16 UTC [26989]: [5360086-1] user= db= host= DETAIL:
Multiple failures --- write error might be permanent.

I checked in global directory of master, the directory 14078 doesn't

exist.

Anyone has faced above issue ?

Thanks

--
Kind regards,
Sergey Konoplev
Database and Software Consultant

Profile: http://www.linkedin.com/in/grayhemp
Phone: USA +1 (415) 867-9984, Russia +7 (901) 903-0499, +7 (988) 888-1979
Skype: gray-hemp
Jabber: gray.ru@gmail.com

Lonni J Friedman

netllama@gmail.com

about 13 years ago

In reply to: Adarsh Sharma (#3)

Re: Replication terminated due to PANIC

If its really index corruption, then you should be able to fix it by
reindexing. However, that doesn't explain what caused the corruption.
Perhaps your hardware is bad in some way?

On Wed, Apr 24, 2013 at 10:46 PM, Adarsh Sharma <eddy.adarsh@gmail.com> wrote:

Thanks Sergey for such a quick response, but i dont think this is some patch
problem because we have other DB servers also running fine on same version
and message is also different :

host= PANIC: _bt_restore_page: cannot add item to page

And the whole day replication is working fine but at midnight when log
rotates it shows belows msg :

2013-04-24 00:00:00 UTC [26989]: [4945032-1] user= db= host= LOG:
checkpoint starting: time
2013-04-24 00:00:00 UTC [26989]: [4945033-1] user= db= host= ERROR: could
not open file "global/14078": No such file or directory

2013-04-24 00:00:00 UTC [26989]: [4945034-1] user= db= host= CONTEXT:
writing block 0 of relation global/14078
2013-04-24 00:00:00 UTC [26989]: [4945035-1] user= db= host= WARNING: could
not write block 0 of global/14078

2013-04-24 00:00:00 UTC [26989]: [4945036-1] user= db= host= DETAIL:
Multiple failures --- write error might be permanent.

Looks like some index corruption.

Thanks

On Thu, Apr 25, 2013 at 8:14 AM, Sergey Konoplev <gray.ru@gmail.com> wrote:

On Wed, Apr 24, 2013 at 5:05 PM, Adarsh Sharma <eddy.adarsh@gmail.com>
wrote:

I have a Postgresql 9.2 instance running on a CentOS6.3 box.Yesterday i
setup a hot standby by using pgbasebackup. Today i got the below alert
from
standby box :

[1] (from line 412,723)
2013-04-24 23:07:18 UTC [13445]: [6-1] user= db= host= PANIC:
_bt_restore_page: cannot add item to page

When i check, the replication is terminated due to slave DB shutdown.
From
the logs i can see below messages :-

I am not sure that it is your situation but take a look at this thread:

/messages/by-id/CAL_0b1t=WuM6roO8dki=w8DhH8P8whhohbPjReymmQUrOcNT2A@mail.gmail.com

There is a patch by Andres Freund in the end of the discussion. Three
weeks have passed after I installed the patched version and it looks
like the patch fixed my issue.

2013-04-24 23:17:16 UTC [26989]: [5360083-1] user= db= host= ERROR:
could
not open file "global/14078": No such file or directory
2013-04-24 23:17:16 UTC [26989]: [5360084-1] user= db= host= CONTEXT:
writing block 0 of relation global/14078
2013-04-24 23:17:16 UTC [26989]: [5360085-1] user= db= host= WARNING:
could
not write block 0 of global/14078
2013-04-24 23:17:16 UTC [26989]: [5360086-1] user= db= host= DETAIL:
Multiple failures --- write error might be permanent.

I checked in global directory of master, the directory 14078 doesn't
exist.

Anyone has faced above issue ?

Thanks

--
Kind regards,
Sergey Konoplev
Database and Software Consultant

Profile: http://www.linkedin.com/in/grayhemp
Phone: USA +1 (415) 867-9984, Russia +7 (901) 903-0499, +7 (988) 888-1979
Skype: gray-hemp
Jabber: gray.ru@gmail.com

--
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
L. Friedman netllama@gmail.com
LlamaLand https://netllama.linux-sxs.org

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Andres Freund

andres@anarazel.de

about 13 years ago

In reply to: Sergey Konoplev (#2)

Re: Replication terminated due to PANIC

On 2013-04-24 19:44:25 -0700, Sergey Konoplev wrote:

On Wed, Apr 24, 2013 at 5:05 PM, Adarsh Sharma <eddy.adarsh@gmail.com> wrote:

I have a Postgresql 9.2 instance running on a CentOS6.3 box.Yesterday i
setup a hot standby by using pgbasebackup. Today i got the below alert from
standby box :

[1] (from line 412,723)
2013-04-24 23:07:18 UTC [13445]: [6-1] user= db= host= PANIC:
_bt_restore_page: cannot add item to page

When i check, the replication is terminated due to slave DB shutdown. From
the logs i can see below messages :-

Does the global/14078 file exist on the primary? What exact commandline
were you using to restore? Which exact version of postgres?

I am not sure that it is your situation but take a look at this thread:

/messages/by-id/CAL_0b1t=WuM6roO8dki=w8DhH8P8whhohbPjReymmQUrOcNT2A@mail.gmail.com

There is a patch by Andres Freund in the end of the discussion.

The issues don't look related.

Three
weeks have passed after I installed the patched version and it looks
like the patch fixed my issue.

Oh, cool! Thanks for verifying.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Adarsh Sharma

eddy.adarsh@gmail.com

about 13 years ago

In reply to: Andres Freund (#5)

Re: Replication terminated due to PANIC

Sorry my bad , didn't mention the full DB version :

9.2.4.8 on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.1.2
20080704 (Red Hat 4.1.2-52), 64-bit

Apart from these i am happy to inform , the issue is fixed now.
Actually there are two Slave set up's on the standby box on different
ports and are two stale processes ( logger and writer ) that are
running with different parent id's on the box. After killing the
processes and reloading conf file, db server is replaying logs
properly.

@Andres : No the directory doesn't exist on master but exists on the
other standby.

@Lonni , i was guessing because of the below message in the logs:-
_bt_restore_page: cannot add item to page

http://en.verysource.com/code/5191515_1/nbtxlog.c.html
Yes we faced H/w issues in master and we flip to slave and setup a new
SR in which we are facing this issue.

Still don't know why this PANIC message came. Anywaz thanks u all for
giving your crucial time into it.

Thanks

On Thu, Apr 25, 2013 at 7:46 PM, Andres Freund <andres@2ndquadrant.com>wrote:

Show quoted text

On 2013-04-24 19:44:25 -0700, Sergey Konoplev wrote:

On Wed, Apr 24, 2013 at 5:05 PM, Adarsh Sharma <eddy.adarsh@gmail.com>

wrote:

I have a Postgresql 9.2 instance running on a CentOS6.3 box.Yesterday i
setup a hot standby by using pgbasebackup. Today i got the below

alert from

standby box :

[1] (from line 412,723)
2013-04-24 23:07:18 UTC [13445]: [6-1] user= db= host= PANIC:
_bt_restore_page: cannot add item to page

When i check, the replication is terminated due to slave DB shutdown.

From

the logs i can see below messages :-

Does the global/14078 file exist on the primary? What exact commandline
were you using to restore? Which exact version of postgres?

I am not sure that it is your situation but take a look at this thread:

/messages/by-id/CAL_0b1t=WuM6roO8dki=w8DhH8P8whhohbPjReymmQUrOcNT2A@mail.gmail.com

There is a patch by Andres Freund in the end of the discussion.

The issues don't look related.

Three
weeks have passed after I installed the patched version and it looks
like the patch fixed my issue.

Oh, cool! Thanks for verifying.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services