2 node bdr setup gives error in replication slots

Started by Nikhilalmost 10 years ago2 messagesgeneral
Jump to latest
#1Nikhil
nikhilsmenon@gmail.com

I am getting below error in my 2 node BDR setup. postgres going down. any
idea?

<35382016-06-07 10:16:59 GMT%LOG: database system was interrupted; last
known up at 2016-06-07 09:06:44 GMT
<35382016-06-07 10:16:59 GMT%PANIC: replication slot file
"pg_replslot/bdr_16389_6293051490331141125_2_16389__/state" has
wrong magic 4522536 instead of 17112225
<35352016-06-07 10:16:59 GMT%LOG: startup process (PID 3538) was
terminated by signal 6: Abort trap
<35352016-06-07 10:16:59 GMT%LOG: aborting startup due to startup process
failure

Best Regards,
Nikhil

#2Nikhil
nikhilsmenon@gmail.com
In reply to: Nikhil (#1)
Re: 2 node bdr setup gives error in replication slots

I think its caused by hard reboots (may b hyper visor itself is rebooted!)
. Is there any setting which can reduce such problems ?

On Tue, Jun 7, 2016 at 5:30 PM, Craig Ringer <craig@2ndquadrant.com> wrote:

Show quoted text

On 7 June 2016 at 18:24, Nikhil <nikhilsmenon@gmail.com> wrote:

I am getting below error in my 2 node BDR setup. postgres going down. any
idea?

<35382016-06-07 10:16:59 GMT%LOG: database system was interrupted; last
known up at 2016-06-07 09:06:44 GMT
<35382016-06-07 10:16:59 GMT%PANIC: replication slot file
"pg_replslot/bdr_16389_6293051490331141125_2_16389__/state" has
wrong magic 4522536 instead of 17112225
<35352016-06-07 10:16:59 GMT%LOG: startup process (PID 3538) was
terminated by signal 6: Abort trap
<35352016-06-07 10:16:59 GMT%LOG: aborting startup due to startup
process failure

That suggests that there was a write failure on the replication slot file.

A simple write error shouldn't be possible because we write the slot file
to a tempfile, then replace the old slot file with the new one. Filesystem
issues are possible, or memory corruption in the application that caused a
bad write. Or a bug, but it's hard to see how we could write the wrong slot
magic number here.

With the slot corrupted all you can really do is part one of the nodes
then join a new one.

If you're able to reproduce this I'd really like to see how it came about.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services