streaming replication - crash on standby

Started by Seong Son (US)over 8 years ago5 messagesgeneral
Jump to latest
#1Seong Son (US)
Seong.Son@datapath.com

The last line from pg_xlogdump of the last WAL file on the crashed standby server shows the following.

pg_xlogdump: FATAL: error in WAL record at DF/4CB95FD0: unexpected pageaddr DB/62B96000 in log segment 00000000000000DF0000004C, offset 12148736

I believe this means the standby server received WAL file out of order? But why did it crash? Is crashing normal behavior in case like this?

Thanks,
Seong

#2Andres Freund
andres@anarazel.de
In reply to: Seong Son (US) (#1)
Re: streaming replication - crash on standby

Hi,

On 2017-08-09 22:03:43 +0000, Seong Son (US) wrote:

The last line from pg_xlogdump of the last WAL file on the crashed standby server shows the following.

pg_xlogdump: FATAL: error in WAL record at DF/4CB95FD0: unexpected pageaddr DB/62B96000 in log segment 00000000000000DF0000004C, offset 12148736

I believe this means the standby server received WAL file out of order? But why did it crash? Is crashing normal behavior in case like this?

This likely just means that that's the end of the WAL.

- Andres

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#3Seong Son (US)
Seong.Son@datapath.com
In reply to: Andres Freund (#2)
Re: streaming replication - crash on standby

I see. Thank you.

But the Postgresql process had crashed at that time so the streaming replication was no longer working. Why would it crash and is that normal?

Thanks,

Seong

This email and any files transmitted with it are intended solely for the use of the individual or entity to whom they are addressed. If you have received this email in error please notify the system manager. This message contains information that is intended only for the individual named. If you are not the named addressee you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately by e-mail if you have received this e-mail by mistake and delete this e-mail from your system. If you are not the intended recipient you are notified that disclosing, copying, distributing or taking any action in reliance on the contents of this information is strictly prohibited.

-----Original Message-----
From: Andres Freund [mailto:andres@anarazel.de]
Sent: Wednesday, August 09, 2017 6:27 PM
To: Seong Son (US) <Seong.Son@datapath.com>
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] streaming replication - crash on standby

Hi,

On 2017-08-09 22:03:43 +0000, Seong Son (US) wrote:

The last line from pg_xlogdump of the last WAL file on the crashed standby server shows the following.

pg_xlogdump: FATAL: error in WAL record at DF/4CB95FD0: unexpected pageaddr DB/62B96000 in log segment 00000000000000DF0000004C, offset 12148736

I believe this means the standby server received WAL file out of order? But why did it crash? Is crashing normal behavior in case like this?

This likely just means that that's the end of the WAL.

- Andres

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#4Andres Freund
andres@anarazel.de
In reply to: Seong Son (US) (#3)
Re: streaming replication - crash on standby

Hi,

Please quote properly on postgres mailing lists.

On 2017-08-09 22:31:23 +0000, Seong Son (US) wrote:

I see. Thank you.

But the Postgresql process had crashed at that time so the streaming replication was no longer working. Why would it crash and is that normal?

You've given us absolutely zero information to be able to diagnose the
problem. If you want somebody to help you you'll have to describe
exactly what happened, and what the problem you're facing is.

- Andres

This email and any files transmitted with it are intended solely for the use of the individual or entity to whom they are addressed. If you have received this email in error please notify the system manager. This message contains information that is intended only for the individual named. If you are not the named addressee you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately by e-mail if you have received this e-mail by mistake and delete this e-mail from your system. If you are not the intended recipient you are notified that disclosing, copying, distributing or taking any action in reliance on the contents of this information is strictly prohibited.

This footer makes no sense on a public list.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#5Seong Son (US)
Seong.Son@datapath.com
In reply to: Andres Freund (#4)
Re: streaming replication - crash on standby

-----Original Message-----
From: Andres Freund [mailto:andres@anarazel.de]
Sent: Wednesday, August 09, 2017 6:34 PM
To: Seong Son (US) <Seong.Son@datapath.com>
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] streaming replication - crash on standby

Hi,

Please quote properly on postgres mailing lists.

On 2017-08-09 22:31:23 +0000, Seong Son (US) wrote:

I see. Thank you.

But the Postgresql process had crashed at that time so the streaming replication was no longer working. Why would it crash and is that normal?

You've given us absolutely zero information to be able to diagnose the problem. If you want somebody to help you you'll have to describe exactly what happened, and what the problem you're facing is.

- Andres

Sorry for lack of info. I've gathered some more info. Hopefully it would be enough to help isolate the cause of the crash of the standby server.

The servers are on Windows Server 2012 R2. Postgresql 9.6. Primary and standby servers are in two different cities connected over VPN.

Here's the last few lines from pg_log at the time of the strandby server's crash:

2017-08-08 21:17:56 UTC FATAL: invalid memory alloc request size 1656315904
2017-08-08 21:17:56 UTC LOG: startup process (PID 2972) exited with exit code 1
2017-08-08 21:17:56 UTC LOG: terminating any other active server processes
2017-08-08 21:17:56 UTC WARNING: terminating connection because of crash of another server process
2017-08-08 21:17:56 UTC DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2017-08-08 21:17:56 UTC HINT: In a moment you should be able to reconnect to the database and repeat your command.
2017-08-08 21:17:56 UTC WARNING: terminating connection because of crash of another server process
2017-08-08 21:17:56 UTC DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2017-08-08 21:17:56 UTC HINT: In a moment you should be able to reconnect to the database and repeat your command.
2017-08-08 21:17:56 UTC LOG: database system is shut down

And this is the last entry from pg_xlogdump:

-08 21:17:36.864852 Coordinated Universal Time
pg_xlogdump: FATAL: error in WAL record at DF/4CB95FD0: unexpected pageaddr DB/62B96000 in log segment 00000000000000DF0000004C, offset 12148736

One thing I noticed is that the network is not the most stable. When I ran wireshark capture on port 5432, I saw numerous errors and warning like
"New fragment overlaps old data (retransmission?)"
"This frame is a (suspected) out-of-order segment"
"This frame is a (suspected) retransmission"

So the questions are, why did the standby server crash? Could the network instability be the cause for the crash?

Thank you in advance for any info.
Seong

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general