replication terminated by primary server

Started by Bruyninckx Kristofover 8 years ago3 messagesgeneral
Jump to latest
#1Bruyninckx Kristof
Kristof.Bruyninckx@cegeka.com

In our environment, we have a master slave replication setup that has been working stable for the last year.
Host systems are debian Jessie and we are using postgres 9.4.
Now recently we have experienced a crash/hung master, and after restarting the postgress services on here the replication stopped working. The master however is running seemingly normal, except for the errors reported when it got restarted. After this nothing error related is reported.

[10192-1] [unknown]@[unknown] LOG: incomplete startup packet
[10222-1] [unknown]@[unknown] LOG: incomplete startup packet
[10033-2] LOG: replication terminated by primary server
[10033-3] DETAIL: End of WAL reached on timeline 2 at 999/A5687790.
[1082-12] LOG: invalid record length at 999/A5687790
[10239-1] LOG: started streaming WAL from primary at 999/A5000000 on timeline 2
[1064-7] LOG: startup process (PID 1082) exited with exit code 1
[1064-8] LOG: terminating any other active server processes
[18749-1] readonly@pal WARNING: terminating connection because of crash of another server process
[25793-1] _readonly@pal WARNING: terminating connection because of crash of another server process

After a recent crash of the postgres master I'm not able to get the slave to start replicating.

I always get the following error message

13247-2] HINT: Future log output will go to log destination "syslog".
[13247-3] LOCATION: PostmasterMain, postmaster.c:1228
[13248-1] LOG: 00000: database system was interrupted while in recovery at log time 2017-12-04 15:10:29 CET
[13248-2] HINT: If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target.
[13248-3] LOCATION: StartupXLOG, xlog.c:6134
[13248-4] LOG: 00000: entering standby mode
[13248-5] LOCATION: StartupXLOG, xlog.c:6203
[13247-4] LOG: 00000: startup process (PID 13248) exited with exit code 1
[13247-5] LOCATION: LogChildExit, postmaster.c:3452
[13247-6] LOG: 00000: aborting startup due to startup process failure

I've already tried to perform a complete backup and resync procedure on the slave
pg_basebackup -D /var/lib/postgresql/backups/fullbackup -R -h <IP> --checkpoint=fast --username=<username> --xlog-method=stream

Which completes without any error message. The odd thing is that the backup folder does already contains a recovery.done file. When I do the same command on a test platform this recovery.done is not created.
But the test is using 9.5. Not sure it is related.

Also the recovery.conf contains all the information is should but still the error message stays the same.
cat recovery.conf
recovery_target_timeline='latest'
standby_mode = 'on'
primary_conninfo = 'user=<user> password=<passwd> host=IP port=5432 sslmode=prefer sslcompression=1 krbsrvname=postgres'

Does this mean that the corruption is on the master system and it needs to be restored to a point before it crashed ? Not sure what I can do to get the replication working again ?
Any ideas ?

Kind Regards,

Kristof

Met vriendelijke groeten / Meilleures salutations / Best regards

Kristof Bruyninckx
System Engineer

E Kristof.Bruyninckx@cegeka.com<mailto:Kristof.Bruyninckx@cegeka.com>
M +32 473 33 50 67

[cid:image003.jpg@01D36E9B.7F398A60]

CEGEKA Universiteitslaan 9
B-3500 Hasselt, Belgium
T +32 11 24 02 34
WWW.CEGEKA.COM<http://www.cegeka.com&gt;

[LinkedIn]<https://www.linkedin.com/company/cegeka&gt;

[Twitter]<https://twitter.com/cegeka&gt;

[Facebook]<https://www.facebook.com/Cegeka/&gt;

[Youtube]<https://www.youtube.com/user/2010Cegeka&gt;

[cid:image008.jpg@01D36E9B.7F398A60]<http://www.cegeka.com/be/nl/emailsignature&gt;

Attachments:

image001.jpgimage/jpeg; name=image001.jpgDownload
image002.jpgimage/jpeg; name=image002.jpgDownload
image003.jpgimage/jpeg; name=image003.jpgDownload
image004.jpgimage/jpeg; name=image004.jpgDownload
image005.jpgimage/jpeg; name=image005.jpgDownload
image006.jpgimage/jpeg; name=image006.jpgDownload
image007.jpgimage/jpeg; name=image007.jpgDownload
image008.jpgimage/jpeg; name=image008.jpgDownload
#2Payal Singh
payals1@umbc.edu
In reply to: Bruyninckx Kristof (#1)
Re: replication terminated by primary server

On Wed, Dec 6, 2017 at 8:07 AM, Bruyninckx Kristof <
Kristof.Bruyninckx@cegeka.com> wrote:

In our environment, we have a master slave replication setup that has been
working stable for the last year.

Host systems are debian Jessie and we are using postgres 9.4.

Now recently we have experienced a crash/hung master, and after restarting
the postgress services on here the replication stopped working. The master
however is running seemingly normal, except for the errors reported when it
got restarted. After this nothing error related is reported.

Might want to check the master postgres logs during and after crash as
well. Also, check for wal file progress on master (select * from
pg_stat_archiver).

--
Payal Singh
Graduate Student
Department of Computer Science and Electrical Engineering
University of Maryland, Baltimore County

#3Bruyninckx Kristof
Kristof.Bruyninckx@cegeka.com
In reply to: Payal Singh (#2)
RE: replication terminated by primary server

I’ve been going over the log of the system at the time of the crash, but I’m not seeing something that stands out as telling me anymore about the reason of either the crash or the failure to start the replication again. I’m attaching a part of the log file to this mail.

Correct me if I’m wrong but the “select * from pg_stat_archiver” is linked with WAL archiving, correct ? Currently this system has archiving switch off. For backup purposes we are running scheduled pg_dumps of each database.
So it didn’t give me any output. Which I think is normal since we switched it off.

To setup the replication we used the pg_basebackup with the --xlog-method=stream option this way pg_basebackup will not only copy the data as it is, but also stream the XLOG being created during the base backup to our destination server.
Not sure what the problem is since it appears to being able to perform these actions without any reported error.

Cheers,

Kristof

From: Payal Singh [mailto:payals1@umbc.edu]
Sent: donderdag 7 december 2017 19:29
To: Bruyninckx Kristof <Kristof.Bruyninckx@cegeka.com>
Cc: pgsql-general@lists.postgresql.org
Subject: Re: replication terminated by primary server

On Wed, Dec 6, 2017 at 8:07 AM, Bruyninckx Kristof <Kristof.Bruyninckx@cegeka.com<mailto:Kristof.Bruyninckx@cegeka.com>> wrote:

In our environment, we have a master slave replication setup that has been working stable for the last year.
Host systems are debian Jessie and we are using postgres 9.4.
Now recently we have experienced a crash/hung master, and after restarting the postgress services on here the replication stopped working. The master however is running seemingly normal, except for the errors reported when it got restarted. After this nothing error related is reported.

Might want to check the master postgres logs during and after crash as well. Also, check for wal file progress on master (select * from pg_stat_archiver).

--
Payal Singh
Graduate Student
Department of Computer Science and Electrical Engineering
University of Maryland, Baltimore County

Attachments:

db.logapplication/octet-stream; name=db.logDownload