replication stops working

Started by John DeSoialmost 13 years ago3 messagesgeneral
Jump to latest
#1John DeSoi
desoi@pgedit.com

I have a 9.2 hot standby setup with replication via rsync. For the second time, it has stopped working with no apparent error on the primary or standby. Last time this happened I fixed it by restarting the primary. Yesterday I started a new base backup around noon and it replicated without any problems for about 12 hours. Then it just stopped and I don't see any errors in the Postgres log (primary or standby). I looked at other system logs and still don't see any problems.

I'm running Postgres 9.2.4 on CentOS 6.4. Thanks for any ideas or debug suggestions.

John DeSoi, Ph.D.

=====

wal_level = hot_standby
wal_keep_segments = 48
max_wal_senders = 2

archive_mode = on
archive_command = 'rsync --whole-file --ignore-existing --delete-after -a %p bak-postgres:/pgbackup/%f'
archive_timeout = 300

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#2Daniel Serodio (lists)
daniel.lists@mandic.com.br
In reply to: John DeSoi (#1)
Re: replication stops working

John DeSoi wrote:

I have a 9.2 hot standby setup with replication via rsync. For the second time, it has stopped working with no apparent error on the primary or standby. Last time this happened I fixed it by restarting the primary. Yesterday I started a new base backup around noon and it replicated without any problems for about 12 hours. Then it just stopped and I don't see any errors in the Postgres log (primary or standby). I looked at other system logs and still don't see any problems.

I'm running Postgres 9.2.4 on CentOS 6.4. Thanks for any ideas or debug suggestions.

John DeSoi, Ph.D.

=====

wal_level = hot_standby
wal_keep_segments = 48
max_wal_senders = 2

archive_mode = on
archive_command = 'rsync --whole-file --ignore-existing --delete-after -a %p bak-postgres:/pgbackup/%f'
archive_timeout = 300

If there are no errors in the log, how did you conclude that replication
has stopped working? Since you're using a hot standby, you've also setup
streaming replication in addition to the WAL archiving, correct?

Regards,
Daniel Serodio

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#3John DeSoi
jdesoi@gmail.com
In reply to: Daniel Serodio (lists) (#2)
Re: replication stops working

On Jul 8, 2013, at 5:41 PM, Daniel Serodio (lists) <daniel.lists@mandic.com.br> wrote:

If there are no errors in the log, how did you conclude that replication has stopped working? Since you're using a hot standby, you've also setup streaming replication in addition to the WAL archiving, correct?

I have an external process that calls pg_last_xact_replay_timestamp and sends an alert if the standby is more than 20 minutes out of sync.

I'm not using streaming replication, just WAL archiving at 5 minute intervals.

I just tried to restart the primary to fix it and it would not shut down. There should not have been any active connections. I finally had to power off the VM.

I think what might be happening is that rsync is hanging when trying to send a WAL file. That might explain no error in the log and difficulty stopping the server. I added a timeout to the archive command; hopefully this will fix it.

John DeSoi, Ph.D.

2013-07-08 21:06:02 EDT [27170]: [1-1] user=main,db=main8,remote=127.0.0.1(62194) FATAL: the database system is shutting down
2013-07-08 21:07:29 EDT [27189]: [1-1] user=postgres,db=postgres,remote=127.0.0.1(62195) FATAL: the database system is shutting down
2013-07-08 21:07:51 EDT [27190]: [1-1] user=postgres,db=postgres,remote=127.0.0.1(62196) FATAL: the database system is shutting down
2013-07-08 21:09:42 EDT [27275]: [1-1] user=postgres,db=postgres,remote=[local] FATAL: the database system is shutting down
2013-07-08 21:11:03 EDT [27363]: [1-1] user=[unknown],db=[unknown],remote=127.0.0.1(62199) LOG: incomplete startup packet
2013-07-08 21:11:03 EDT [27364]: [1-1] user=main,db=main8,remote=127.0.0.1(62200) FATAL: the database system is shutting down
Killed by signal 15.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general