postgres 9.0.2 replicated database is crashing

Started by akp geekabout 13 years ago7 messagesgeneral

akpgeek@gmail.com

about 13 years ago

Hi all -

I am in desperate need of your help. The replication/streaming
stopped working on March5th. I followed the following procedure to restart
the streaming. After running it for couple of hours , the database is
crashing on the slave. This is on our production server. Thanks for your
help.

Steps that I followed

1. SELECT pg_start_backup('label', true);
2. I created tar of data.
3. un tar the data on the slave.
4. deleted the pid file on the slave
5. created recovery.conf
5. SELECT pg_stop_backup(); on the primary
6. started the postgres on slave
7. After 2 hours, the slave server crashes

Please help.

Regards

Lonni J Friedman

netllama@gmail.com

about 13 years ago

In reply to: akp geek (#1)

Re: postgres 9.0.2 replicated database is crashing

That process merely sets up a new server, it doesn't start streaming,
unless the server has been configured correctly. You state that the
slave crashed after two hours. How did you make this determination?

All you seem to be doing is setting up the slave from scratch
repeatedly, and assuming that it will magically just work, rather than
understanding & debugging why its not working. Where is the log
output from your servers that shows what is really transpiring?

On Sat, Mar 9, 2013 at 6:51 AM, akp geek <akpgeek@gmail.com> wrote:

Hi all -

I am in desperate need of your help. The replication/streaming
stopped working on March5th. I followed the following procedure to restart
the streaming. After running it for couple of hours , the database is
crashing on the slave. This is on our production server. Thanks for your
help.

Steps that I followed

1. SELECT pg_start_backup('label', true);
2. I created tar of data.
3. un tar the data on the slave.
4. deleted the pid file on the slave
5. created recovery.conf
5. SELECT pg_stop_backup(); on the primary
6. started the postgres on slave
7. After 2 hours, the slave server crashes

Please help.

Regards

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

akp geek

akpgeek@gmail.com

about 13 years ago

In reply to: Lonni J Friedman (#2)

Re: postgres 9.0.2 replicated database is crashing

thank you. As you mentioned, I understood that I am starting the streaming
scratch which is not what I wanted to do.

Here is what I am planning to .

Our replication process was down since March5th.

1. Is it Ok to get all wals from March5th till now to standby pg_xlog
2. take pg_basebackup
3. export the data dir
4. stop backup
5. restart the standby.

Based on my understanding it should work, because when standby started, it
is not able find the files, as I have deleted from the slave them when I
tried to fix the issue

00000 2013-03-09 04:56:08 GMT LOG: entering standby mode
cp: cannot access /backup/9.0.4/archive/0000000100000106000000E1
00000 2013-03-09 04:56:08 GMT LOG: redo starts at 106/E1000020
cp: cannot access /backup/9.0.4/archive/0000000100000106000000E2
cp: cannot access /backup/9.0.4/archive/0000000100000106000000E3
cp: cannot access /backup/9.0.4/archive/0000000100000106000000E4
00000 2013-03-09 04:56:10 GMT LOG: consistent recovery state reached at
106/E45AD4A8
00000 2013-03-09 04:56:10 GMT LOG: database system is ready to accept
read only connections
cp: cannot access /backup/9.0.4/archive/0000000100000106000000E5
cp: cannot access /backup/9.0.4/archive/0000000100000106000000E6
cp: cannot access /backup/9.0.4/archive/0000000100000106000000E7
cp: cannot access /backup/9.0.4/archive/0000000100000106000000E8
cp: cannot access /backup/9.0.4/archive/0000000100000106000000E9
cp: cannot access /backup/9.0.4/archive/0000000100000106000000EA
00000 2013-03-09 04:56:12 GMT LOG: invalid record length at 106/EA10B8C0
cp: cannot access /backup/9.0.4/archive/0000000100000106000000EA
00000 2013-03-09 04:56:12 GMT LOG: streaming replication successfully
connected to primary
[unknown] 00000 2013-03-09 04:57:00 GMT [unknown]LOG: connection
received: host=[local]
[unknown] 00000 2013-03-09 04:57:03 GMT [unknown]LOG: connection
received: host=[local]

[unknown] 10.155.253.43(51257) 00000 SELECT 2013-03-09 07:07:18 GMT
prodLog: duration: 6316.649 ms
[unknown] 10.155.253.43(51257) 00000 idle 2013-03-09 07:47:53 GMT prodLog:
disconnection: session time: 0:41:06.529 user=postgres database=fprod host=
10.155.253.43 port=51257
00000 2013-03-09 07:55:48 GMT LOG: restartpoint starting: time
00000 2013-03-09 08:25:47 GMT LOG: restartpoint complete: wrote 19419
buffers (7.4%); write=1799.792 s, sync=0.066 s, total=1799.867 s
00000 2013-03-09 08:25:47 GMT LOG: recovery restart point at
107/FB01B238
00000 2013-03-09 08:25:47 GMT DETAIL: last completed transaction was at
log time 2013-03-09 08:25:41.85776+00
00000 2013-03-09 08:55:48 GMT LOG: restartpoint starting: time
psql [local] 08P01 idle 2013-03-09 09:10:52 GMT prodLog: unexpected EOF on
client connection
psql [local] 00000 idle 2013-03-09 09:10:52 GMT prodLog: disconnection:
session time: 2:15:06.351 user=postgres database=fprod host=[local]
XX000 2013-03-09 09:23:46 GMT FATAL: failed to add item to index page
in 100924/100937/1225845
XX000 2013-03-09 09:23:46 GMT CONTEXT: xlog redo Insert item, node:
100924/100937/1225845 blkno: 72666 offset: 234 nitem: 1 isdata: F isleaf F
isdelete
F updateBlkno:36483
00000 2013-03-09 09:23:46 GMT LOG: startup process (PID 3880) exited
with exit code 1
00000 2013-03-09 09:23:46 GMT LOG: terminating any other active server
processes

thanks

On Sat, Mar 9, 2013 at 11:16 AM, Lonni J Friedman <netllama@gmail.com>wrote:

Show quoted text

That process merely sets up a new server, it doesn't start streaming,
unless the server has been configured correctly. You state that the
slave crashed after two hours. How did you make this determination?

All you seem to be doing is setting up the slave from scratch
repeatedly, and assuming that it will magically just work, rather than
understanding & debugging why its not working. Where is the log
output from your servers that shows what is really transpiring?

On Sat, Mar 9, 2013 at 6:51 AM, akp geek <akpgeek@gmail.com> wrote:

Hi all -

I am in desperate need of your help. The replication/streaming
stopped working on March5th. I followed the following procedure to

restart

the streaming. After running it for couple of hours , the database is
crashing on the slave. This is on our production server. Thanks for

your

help.

Steps that I followed

1. SELECT pg_start_backup('label', true);
2. I created tar of data.
3. un tar the data on the slave.
4. deleted the pid file on the slave
5. created recovery.conf
5. SELECT pg_stop_backup(); on the primary
6. started the postgres on slave
7. After 2 hours, the slave server crashes

Please help.

Regards

Lonni J Friedman

netllama@gmail.com

about 13 years ago

In reply to: akp geek (#3)

Re: postgres 9.0.2 replicated database is crashing

On Sat, Mar 9, 2013 at 1:51 PM, akp geek <akpgeek@gmail.com> wrote:

thank you. As you mentioned, I understood that I am starting the streaming
scratch which is not what I wanted to do.

Here is what I am planning to .

Our replication process was down since March5th.

1. Is it Ok to get all wals from March5th till now to standby pg_xlog
2. take pg_basebackup
3. export the data dir
4. stop backup
5. restart the standby.

Based on my understanding it should work, because when standby started, it
is not able find the files, as I have deleted from the slave them when I
tried to fix the issue

Clearly it didn't work because you've been having issues for the past
4 days, and you've already tried this approach unsuccessfully. Based
on the log snippets below, it looks like you have multiple problems.
First your slave is definitely missing WAL segments. You should
increase the number of WAL segments that are archived on the master
and ensure that your base backup is including the WAL segments (by
generating it with the -x option). However, that's the least of your
problems at the moment, because it looks like the master is what is
crashing.

00000 2013-03-09 04:56:08 GMT LOG: entering standby mode
cp: cannot access /backup/9.0.4/archive/0000000100000106000000E1
00000 2013-03-09 04:56:08 GMT LOG: redo starts at 106/E1000020
cp: cannot access /backup/9.0.4/archive/0000000100000106000000E2
cp: cannot access /backup/9.0.4/archive/0000000100000106000000E3
cp: cannot access /backup/9.0.4/archive/0000000100000106000000E4
00000 2013-03-09 04:56:10 GMT LOG: consistent recovery state reached at
106/E45AD4A8
00000 2013-03-09 04:56:10 GMT LOG: database system is ready to accept
read only connections
cp: cannot access /backup/9.0.4/archive/0000000100000106000000E5
cp: cannot access /backup/9.0.4/archive/0000000100000106000000E6
cp: cannot access /backup/9.0.4/archive/0000000100000106000000E7
cp: cannot access /backup/9.0.4/archive/0000000100000106000000E8
cp: cannot access /backup/9.0.4/archive/0000000100000106000000E9
cp: cannot access /backup/9.0.4/archive/0000000100000106000000EA
00000 2013-03-09 04:56:12 GMT LOG: invalid record length at 106/EA10B8C0
cp: cannot access /backup/9.0.4/archive/0000000100000106000000EA
00000 2013-03-09 04:56:12 GMT LOG: streaming replication successfully
connected to primary
[unknown] 00000 2013-03-09 04:57:00 GMT [unknown]LOG: connection
received: host=[local]
[unknown] 00000 2013-03-09 04:57:03 GMT [unknown]LOG: connection
received: host=[local]

[unknown] 10.155.253.43(51257) 00000 SELECT 2013-03-09 07:07:18 GMT prodLog:
duration: 6316.649 ms
[unknown] 10.155.253.43(51257) 00000 idle 2013-03-09 07:47:53 GMT prodLog:
disconnection: session time: 0:41:06.529 user=postgres database=fprod host=
10.155.253.43 port=51257
00000 2013-03-09 07:55:48 GMT LOG: restartpoint starting: time
00000 2013-03-09 08:25:47 GMT LOG: restartpoint complete: wrote 19419
buffers (7.4%); write=1799.792 s, sync=0.066 s, total=1799.867 s
00000 2013-03-09 08:25:47 GMT LOG: recovery restart point at
107/FB01B238
00000 2013-03-09 08:25:47 GMT DETAIL: last completed transaction was at
log time 2013-03-09 08:25:41.85776+00
00000 2013-03-09 08:55:48 GMT LOG: restartpoint starting: time
psql [local] 08P01 idle 2013-03-09 09:10:52 GMT prodLog: unexpected EOF on
client connection
psql [local] 00000 idle 2013-03-09 09:10:52 GMT prodLog: disconnection:
session time: 2:15:06.351 user=postgres database=fprod host=[local]
XX000 2013-03-09 09:23:46 GMT FATAL: failed to add item to index page in
100924/100937/1225845
XX000 2013-03-09 09:23:46 GMT CONTEXT: xlog redo Insert item, node:
100924/100937/1225845 blkno: 72666 offset: 234 nitem: 1 isdata: F isleaf F
isdelete
F updateBlkno:36483
00000 2013-03-09 09:23:46 GMT LOG: startup process (PID 3880) exited
with exit code 1
00000 2013-03-09 09:23:46 GMT LOG: terminating any other active server
processes

The real problem is here:
XX000 2013-03-09 09:23:46 GMT FATAL: failed to add item to index
page in 100924/100937/1225845
XX000 2013-03-09 09:23:46 GMT CONTEXT: xlog redo Insert item,
node: 100924/100937/1225845 blkno: 72666 offset: 234 nitem: 1 isdata:
F isleaf F isdelete F updateBlkno:36483

That looks like something is badly broken (maybe data corruption)?
Most of the google hits on that error are associated with GIST
indexes. Are you using GIST indexes?

Are you really (still) using 9.0.4 ? I hope you realize that there
are known data corruption bugs in that version, and that version is
super old at this point. You really need to update to 9.0.12.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

akp geek

akpgeek@gmail.com

about 13 years ago

In reply to: Lonni J Friedman (#4)

Re: postgres 9.0.2 replicated database is crashing

Appreciate your findings. taking your points and doing things now. We can
upgrade. Which version is more stable 9.2 or 9.1

We use GIST indexes quite a bit. and we gis also

I recently compiled postgres 9.2 ..

Regards

On Sat, Mar 9, 2013 at 5:09 PM, Lonni J Friedman <netllama@gmail.com> wrote:

Show quoted text

On Sat, Mar 9, 2013 at 1:51 PM, akp geek <akpgeek@gmail.com> wrote:

thank you. As you mentioned, I understood that I am starting the

streaming

scratch which is not what I wanted to do.

Here is what I am planning to .

Our replication process was down since March5th.

1. Is it Ok to get all wals from March5th till now to standby pg_xlog
2. take pg_basebackup
3. export the data dir
4. stop backup
5. restart the standby.

Based on my understanding it should work, because when standby started,

it

is not able find the files, as I have deleted from the slave them when I
tried to fix the issue

Clearly it didn't work because you've been having issues for the past
4 days, and you've already tried this approach unsuccessfully. Based
on the log snippets below, it looks like you have multiple problems.
First your slave is definitely missing WAL segments. You should
increase the number of WAL segments that are archived on the master
and ensure that your base backup is including the WAL segments (by
generating it with the -x option). However, that's the least of your
problems at the moment, because it looks like the master is what is
crashing.

00000 2013-03-09 04:56:08 GMT LOG: entering standby mode
cp: cannot access /backup/9.0.4/archive/0000000100000106000000E1
00000 2013-03-09 04:56:08 GMT LOG: redo starts at 106/E1000020
cp: cannot access /backup/9.0.4/archive/0000000100000106000000E2
cp: cannot access /backup/9.0.4/archive/0000000100000106000000E3
cp: cannot access /backup/9.0.4/archive/0000000100000106000000E4
00000 2013-03-09 04:56:10 GMT LOG: consistent recovery state reached

at

106/E45AD4A8
00000 2013-03-09 04:56:10 GMT LOG: database system is ready to accept
read only connections
cp: cannot access /backup/9.0.4/archive/0000000100000106000000E5
cp: cannot access /backup/9.0.4/archive/0000000100000106000000E6
cp: cannot access /backup/9.0.4/archive/0000000100000106000000E7
cp: cannot access /backup/9.0.4/archive/0000000100000106000000E8
cp: cannot access /backup/9.0.4/archive/0000000100000106000000E9
cp: cannot access /backup/9.0.4/archive/0000000100000106000000EA
00000 2013-03-09 04:56:12 GMT LOG: invalid record length at

106/EA10B8C0

cp: cannot access /backup/9.0.4/archive/0000000100000106000000EA
00000 2013-03-09 04:56:12 GMT LOG: streaming replication successfully
connected to primary
[unknown] 00000 2013-03-09 04:57:00 GMT [unknown]LOG: connection
received: host=[local]
[unknown] 00000 2013-03-09 04:57:03 GMT [unknown]LOG: connection
received: host=[local]

[unknown] 10.155.253.43(51257) 00000 SELECT 2013-03-09 07:07:18 GMT

prodLog:

duration: 6316.649 ms
[unknown] 10.155.253.43(51257) 00000 idle 2013-03-09 07:47:53 GMT

prodLog:

disconnection: session time: 0:41:06.529 user=postgres database=fprod

host=

10.155.253.43 port=51257
00000 2013-03-09 07:55:48 GMT LOG: restartpoint starting: time
00000 2013-03-09 08:25:47 GMT LOG: restartpoint complete: wrote 19419
buffers (7.4%); write=1799.792 s, sync=0.066 s, total=1799.867 s
00000 2013-03-09 08:25:47 GMT LOG: recovery restart point at
107/FB01B238
00000 2013-03-09 08:25:47 GMT DETAIL: last completed transaction was

at

log time 2013-03-09 08:25:41.85776+00
00000 2013-03-09 08:55:48 GMT LOG: restartpoint starting: time
psql [local] 08P01 idle 2013-03-09 09:10:52 GMT prodLog: unexpected EOF

on

client connection
psql [local] 00000 idle 2013-03-09 09:10:52 GMT prodLog: disconnection:
session time: 2:15:06.351 user=postgres database=fprod host=[local]
XX000 2013-03-09 09:23:46 GMT FATAL: failed to add item to index

page in

100924/100937/1225845
XX000 2013-03-09 09:23:46 GMT CONTEXT: xlog redo Insert item, node:
100924/100937/1225845 blkno: 72666 offset: 234 nitem: 1 isdata: F isleaf

F

isdelete
F updateBlkno:36483
00000 2013-03-09 09:23:46 GMT LOG: startup process (PID 3880) exited
with exit code 1
00000 2013-03-09 09:23:46 GMT LOG: terminating any other active

server

processes

The real problem is here:
XX000 2013-03-09 09:23:46 GMT FATAL: failed to add item to index
page in 100924/100937/1225845
XX000 2013-03-09 09:23:46 GMT CONTEXT: xlog redo Insert item,
node: 100924/100937/1225845 blkno: 72666 offset: 234 nitem: 1 isdata:
F isleaf F isdelete F updateBlkno:36483

That looks like something is badly broken (maybe data corruption)?
Most of the google hits on that error are associated with GIST
indexes. Are you using GIST indexes?

Are you really (still) using 9.0.4 ? I hope you realize that there
are known data corruption bugs in that version, and that version is
super old at this point. You really need to update to 9.0.12.

Lonni J Friedman

netllama@gmail.com

about 13 years ago

In reply to: akp geek (#5)

Re: postgres 9.0.2 replicated database is crashing

On Sat, Mar 9, 2013 at 4:05 PM, akp geek <akpgeek@gmail.com> wrote:

Appreciate your findings. taking your points and doing things now. We can
upgrade. Which version is more stable 9.2 or 9.1

They're two entirely different branches. They should both be equally
stable. However, if you're looking for the most straightforward path
I'd recommend going to 9.0.12. Also be sure to read the release notes
first.

We use GIST indexes quite a bit. and we gis also

I recently compiled postgres 9.2 ..

Regards

On Sat, Mar 9, 2013 at 5:09 PM, Lonni J Friedman <netllama@gmail.com> wrote:

On Sat, Mar 9, 2013 at 1:51 PM, akp geek <akpgeek@gmail.com> wrote:

thank you. As you mentioned, I understood that I am starting the
streaming
scratch which is not what I wanted to do.

Here is what I am planning to .

Our replication process was down since March5th.

1. Is it Ok to get all wals from March5th till now to standby pg_xlog
2. take pg_basebackup
3. export the data dir
4. stop backup
5. restart the standby.

Based on my understanding it should work, because when standby started,
it
is not able find the files, as I have deleted from the slave them when I
tried to fix the issue

Clearly it didn't work because you've been having issues for the past
4 days, and you've already tried this approach unsuccessfully. Based
on the log snippets below, it looks like you have multiple problems.
First your slave is definitely missing WAL segments. You should
increase the number of WAL segments that are archived on the master
and ensure that your base backup is including the WAL segments (by
generating it with the -x option). However, that's the least of your
problems at the moment, because it looks like the master is what is
crashing.

00000 2013-03-09 04:56:08 GMT LOG: entering standby mode
cp: cannot access /backup/9.0.4/archive/0000000100000106000000E1
00000 2013-03-09 04:56:08 GMT LOG: redo starts at 106/E1000020
cp: cannot access /backup/9.0.4/archive/0000000100000106000000E2
cp: cannot access /backup/9.0.4/archive/0000000100000106000000E3
cp: cannot access /backup/9.0.4/archive/0000000100000106000000E4
00000 2013-03-09 04:56:10 GMT LOG: consistent recovery state reached
at
106/E45AD4A8
00000 2013-03-09 04:56:10 GMT LOG: database system is ready to
accept
read only connections
cp: cannot access /backup/9.0.4/archive/0000000100000106000000E5
cp: cannot access /backup/9.0.4/archive/0000000100000106000000E6
cp: cannot access /backup/9.0.4/archive/0000000100000106000000E7
cp: cannot access /backup/9.0.4/archive/0000000100000106000000E8
cp: cannot access /backup/9.0.4/archive/0000000100000106000000E9
cp: cannot access /backup/9.0.4/archive/0000000100000106000000EA
00000 2013-03-09 04:56:12 GMT LOG: invalid record length at
106/EA10B8C0
cp: cannot access /backup/9.0.4/archive/0000000100000106000000EA
00000 2013-03-09 04:56:12 GMT LOG: streaming replication
successfully
connected to primary
[unknown] 00000 2013-03-09 04:57:00 GMT [unknown]LOG: connection
received: host=[local]
[unknown] 00000 2013-03-09 04:57:03 GMT [unknown]LOG: connection
received: host=[local]

[unknown] 10.155.253.43(51257) 00000 SELECT 2013-03-09 07:07:18 GMT
prodLog:
duration: 6316.649 ms
[unknown] 10.155.253.43(51257) 00000 idle 2013-03-09 07:47:53 GMT
prodLog:
disconnection: session time: 0:41:06.529 user=postgres database=fprod
host=
10.155.253.43 port=51257
00000 2013-03-09 07:55:48 GMT LOG: restartpoint starting: time
00000 2013-03-09 08:25:47 GMT LOG: restartpoint complete: wrote
19419
buffers (7.4%); write=1799.792 s, sync=0.066 s, total=1799.867 s
00000 2013-03-09 08:25:47 GMT LOG: recovery restart point at
107/FB01B238
00000 2013-03-09 08:25:47 GMT DETAIL: last completed transaction was
at
log time 2013-03-09 08:25:41.85776+00
00000 2013-03-09 08:55:48 GMT LOG: restartpoint starting: time
psql [local] 08P01 idle 2013-03-09 09:10:52 GMT prodLog: unexpected EOF
on
client connection
psql [local] 00000 idle 2013-03-09 09:10:52 GMT prodLog: disconnection:
session time: 2:15:06.351 user=postgres database=fprod host=[local]
XX000 2013-03-09 09:23:46 GMT FATAL: failed to add item to index
page in
100924/100937/1225845
XX000 2013-03-09 09:23:46 GMT CONTEXT: xlog redo Insert item, node:
100924/100937/1225845 blkno: 72666 offset: 234 nitem: 1 isdata: F isleaf
F
isdelete
F updateBlkno:36483
00000 2013-03-09 09:23:46 GMT LOG: startup process (PID 3880) exited
with exit code 1
00000 2013-03-09 09:23:46 GMT LOG: terminating any other active
server
processes

The real problem is here:
XX000 2013-03-09 09:23:46 GMT FATAL: failed to add item to index
page in 100924/100937/1225845
XX000 2013-03-09 09:23:46 GMT CONTEXT: xlog redo Insert item,
node: 100924/100937/1225845 blkno: 72666 offset: 234 nitem: 1 isdata:
F isleaf F isdelete F updateBlkno:36483

That looks like something is badly broken (maybe data corruption)?
Most of the google hits on that error are associated with GIST
indexes. Are you using GIST indexes?

Are you really (still) using 9.0.4 ? I hope you realize that there
are known data corruption bugs in that version, and that version is
super old at this point. You really need to update to 9.0.12.

--
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
L. Friedman netllama@gmail.com
LlamaLand https://netllama.linux-sxs.org

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

akp geek

akpgeek@gmail.com

about 13 years ago

In reply to: Lonni J Friedman (#6)

Re: postgres 9.0.2 replicated database is crashing

thanks a lot. One final question on this.

if the streaming fails as in our case, we missed 4 days worth of data. We
archive the log files on the primary. if we were to restart , what should
be starting the point ?

Do we need to get all the archived logs to the pg_xlogs from the date ( in
our case 03/05 ) . This is where I got confused.

Again thanks a lot. Now we will definitely upgrade

Regards

On Sat, Mar 9, 2013 at 7:11 PM, Lonni J Friedman <netllama@gmail.com> wrote:

Show quoted text

On Sat, Mar 9, 2013 at 4:05 PM, akp geek <akpgeek@gmail.com> wrote:

Appreciate your findings. taking your points and doing things now. We

can

upgrade. Which version is more stable 9.2 or 9.1

They're two entirely different branches. They should both be equally
stable. However, if you're looking for the most straightforward path
I'd recommend going to 9.0.12. Also be sure to read the release notes
first.

We use GIST indexes quite a bit. and we gis also

I recently compiled postgres 9.2 ..

Regards

On Sat, Mar 9, 2013 at 5:09 PM, Lonni J Friedman <netllama@gmail.com>

wrote:

On Sat, Mar 9, 2013 at 1:51 PM, akp geek <akpgeek@gmail.com> wrote:

thank you. As you mentioned, I understood that I am starting the
streaming
scratch which is not what I wanted to do.

Here is what I am planning to .

Our replication process was down since March5th.

1. Is it Ok to get all wals from March5th till now to standby pg_xlog
2. take pg_basebackup
3. export the data dir
4. stop backup
5. restart the standby.

Based on my understanding it should work, because when standby

started,

it
is not able find the files, as I have deleted from the slave them

when I

tried to fix the issue

Clearly it didn't work because you've been having issues for the past
4 days, and you've already tried this approach unsuccessfully. Based
on the log snippets below, it looks like you have multiple problems.
First your slave is definitely missing WAL segments. You should
increase the number of WAL segments that are archived on the master
and ensure that your base backup is including the WAL segments (by
generating it with the -x option). However, that's the least of your
problems at the moment, because it looks like the master is what is
crashing.

00000 2013-03-09 04:56:08 GMT LOG: entering standby mode
cp: cannot access /backup/9.0.4/archive/0000000100000106000000E1
00000 2013-03-09 04:56:08 GMT LOG: redo starts at 106/E1000020
cp: cannot access /backup/9.0.4/archive/0000000100000106000000E2
cp: cannot access /backup/9.0.4/archive/0000000100000106000000E3
cp: cannot access /backup/9.0.4/archive/0000000100000106000000E4
00000 2013-03-09 04:56:10 GMT LOG: consistent recovery state

reached

at
106/E45AD4A8
00000 2013-03-09 04:56:10 GMT LOG: database system is ready to
accept
read only connections
cp: cannot access /backup/9.0.4/archive/0000000100000106000000E5
cp: cannot access /backup/9.0.4/archive/0000000100000106000000E6
cp: cannot access /backup/9.0.4/archive/0000000100000106000000E7
cp: cannot access /backup/9.0.4/archive/0000000100000106000000E8
cp: cannot access /backup/9.0.4/archive/0000000100000106000000E9
cp: cannot access /backup/9.0.4/archive/0000000100000106000000EA
00000 2013-03-09 04:56:12 GMT LOG: invalid record length at
106/EA10B8C0
cp: cannot access /backup/9.0.4/archive/0000000100000106000000EA
00000 2013-03-09 04:56:12 GMT LOG: streaming replication
successfully
connected to primary
[unknown] 00000 2013-03-09 04:57:00 GMT [unknown]LOG: connection
received: host=[local]
[unknown] 00000 2013-03-09 04:57:03 GMT [unknown]LOG: connection
received: host=[local]

[unknown] 10.155.253.43(51257) 00000 SELECT 2013-03-09 07:07:18 GMT
prodLog:
duration: 6316.649 ms
[unknown] 10.155.253.43(51257) 00000 idle 2013-03-09 07:47:53 GMT
prodLog:
disconnection: session time: 0:41:06.529 user=postgres database=fprod
host=
10.155.253.43 port=51257
00000 2013-03-09 07:55:48 GMT LOG: restartpoint starting: time
00000 2013-03-09 08:25:47 GMT LOG: restartpoint complete: wrote
19419
buffers (7.4%); write=1799.792 s, sync=0.066 s, total=1799.867 s
00000 2013-03-09 08:25:47 GMT LOG: recovery restart point at
107/FB01B238
00000 2013-03-09 08:25:47 GMT DETAIL: last completed transaction

was

at
log time 2013-03-09 08:25:41.85776+00
00000 2013-03-09 08:55:48 GMT LOG: restartpoint starting: time
psql [local] 08P01 idle 2013-03-09 09:10:52 GMT prodLog: unexpected

EOF

on
client connection
psql [local] 00000 idle 2013-03-09 09:10:52 GMT prodLog:

disconnection:

session time: 2:15:06.351 user=postgres database=fprod host=[local]
XX000 2013-03-09 09:23:46 GMT FATAL: failed to add item to index
page in
100924/100937/1225845
XX000 2013-03-09 09:23:46 GMT CONTEXT: xlog redo Insert item,

node:

100924/100937/1225845 blkno: 72666 offset: 234 nitem: 1 isdata: F

isleaf

F
isdelete
F updateBlkno:36483
00000 2013-03-09 09:23:46 GMT LOG: startup process (PID 3880)

exited

with exit code 1
00000 2013-03-09 09:23:46 GMT LOG: terminating any other active
server
processes

The real problem is here:
XX000 2013-03-09 09:23:46 GMT FATAL: failed to add item to index
page in 100924/100937/1225845
XX000 2013-03-09 09:23:46 GMT CONTEXT: xlog redo Insert item,
node: 100924/100937/1225845 blkno: 72666 offset: 234 nitem: 1 isdata:
F isleaf F isdelete F updateBlkno:36483

That looks like something is badly broken (maybe data corruption)?
Most of the google hits on that error are associated with GIST
indexes. Are you using GIST indexes?

Are you really (still) using 9.0.4 ? I hope you realize that there
are known data corruption bugs in that version, and that version is
super old at this point. You really need to update to 9.0.12.

--
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
L. Friedman netllama@gmail.com
LlamaLand https://netllama.linux-sxs.org