Streaming Replication Server Crash

Started by raghu ramover 13 years ago8 messagesgeneral

raghuchennuru@gmail.com

over 13 years ago

Hi All,

We have configured Streaming Replication b/w Primary and Standby server and
Pgpool-II load balancing module diverting SELECT statements to Standby
server. As per our observations, Standby server crashed during peak hours
on today and error message as follows:

2012-10-19 12:26:43 IST [11934]: [1-1] user=postgres,db=hmis LOG:
duration: 20345.702 ms execute <unnamed>: SELECT DISTINCT
y.ais_insm_name,y.ais_insm_type,ha_dism_name,ha_hudm_name,x2.ais_equip_receipt_details_value_lov,
x3.ais_equip_receipt_details_value,x4.ais_equip_receipt_details_value,
mast_frm_freq, t_report_code, t_report_date, t_report_month,
t_report_year,t_report_createdate,t_report_date2, t_report_month2,
t_report_year2,mast_frm_validity_days FROM TNHSPTOOL_FORM_MASTER,
TRAN_712_MASTER full join ais_institution_master as y on(
t_instiution_code = y.ais_insm_code) full join ha_district_master on(
t_district_code=ha_dism_code ) full join ha_hud_master on(
t_hud_code=ha_hudm_code )full join tran_712_664 x2 on
(x2.t_rep_code=t_report_code and x2.t_rep_rowno=3) full join tran_712_664
x3 on (x3.t_rep_code=t_report_code and x3.t_rep_rowno=5) full join
tran_712_664 x4 on (x4.t_rep_code=t_report_code and x4.t_rep_rowno=36)
WHERE mast_frm_code = T_REPORT_FRMID AND y.ais_insm_code ='00231' and
T_REPORT_FRMID =712 and T_REPORT_OLDFLAG='O' ORDER BY T_REPORT_CODE DESC

2012-10-19 12:26:46 IST [1338]: [18-1] user=,db= LOG: server process (PID
15565) was terminated by signal 10

2012-10-19 12:26:46 IST [1338]: [19-1] user=,db= LOG: terminating any
other active server processes

2012-10-19 12:26:46 IST [18450]: [3-1] user=postgres,db=DBHMS WARNING:
terminating connection because of crash of another server process

Standby Server configuration Details as follows:

max_wal_senders = 5
wal_sender_delay = 200ms
wal_keep_segments = 128
vacuum_defer_cleanup_age = 0
hot_standby = on
max_standby_archive_delay = -1
max_standby_streaming_delay = -1

PostgreSQL version: 9.0.4
OS: Solaris 64bit

Could you please share your inputs,to fix this issue.

Thanks & Regards,

Raghu Ram

Craig Ringer

craig@2ndquadrant.com

over 13 years ago

In reply to: raghu ram (#1)

Re: Streaming Replication Server Crash

On 10/19/2012 04:40 PM, raghu ram wrote:

Hi All,

We have configured Streaming Replication b/w Primary and Standby server
and Pgpool-II load balancing module diverting
SELECT statements to Standby server. As per our observations, Standby
server crashed during peak hours on today and error message as follows

2012-10-19 12:26:46 IST [1338]: [18-1] user=,db= LOG: server process
(PID 15565) was terminated by signal 10

2012-10-19 12:26:46 IST [1338]: [19-1] user=,db= LOG: terminating any
other active server processes

That's odd. SIGUSR1 (signal 10) shouldn't terminate PostgreSQL.

Was the server intentionally sent SIGUSR1 by an admin? Do you know what
triggered the signal?

Are you running any procedural languages other than PL/PgSQL, or any
custom C extensions? Anything that might have unwittingly cleared the
signal handler for SIGUSR1?

--
Craig Ringer

Tom Lane

tgl@sss.pgh.pa.us

over 13 years ago

In reply to: Craig Ringer (#2)

Re: [ADMIN] Streaming Replication Server Crash

Craig Ringer <ringerc@ringerc.id.au> writes:

On 10/19/2012 04:40 PM, raghu ram wrote:

2012-10-19 12:26:46 IST [1338]: [18-1] user=,db= LOG: server process
(PID 15565) was terminated by signal 10

That's odd. SIGUSR1 (signal 10) shouldn't terminate PostgreSQL.

Was the server intentionally sent SIGUSR1 by an admin? Do you know what
triggered the signal?

SIGUSR1 is used for all sorts of internal cross-process signaling
purposes. There's no need to hypothesize any external force sending
it; if somebody had broken a PG process's signal handling setup for
SIGUSR1, a crash of this sort could be expected in short order.

But having said that, are we sure 10 is SIGUSR1 on the OP's platform?
AFAIK, that signal number is not at all compatible across different
flavors of Unix. (I see SIGUSR1 is 30 on OS X for instance.)

Are you running any procedural languages other than PL/PgSQL, or any
custom C extensions? Anything that might have unwittingly cleared the
signal handler for SIGUSR1?

libperl has a bad habit of thinking it can mess with the process's
signal setup ...

regards, tom lane

Craig Ringer

craig@2ndquadrant.com

over 13 years ago

In reply to: Tom Lane (#3)

Re: [ADMIN] Streaming Replication Server Crash

On 10/22/2012 08:52 PM, Tom Lane wrote:

Craig Ringer <ringerc@ringerc.id.au> writes:

On 10/19/2012 04:40 PM, raghu ram wrote:

2012-10-19 12:26:46 IST [1338]: [18-1] user=,db= LOG: server process
(PID 15565) was terminated by signal 10

That's odd. SIGUSR1 (signal 10) shouldn't terminate PostgreSQL.

Was the server intentionally sent SIGUSR1 by an admin? Do you know what
triggered the signal?

SIGUSR1 is used for all sorts of internal cross-process signaling
purposes. There's no need to hypothesize any external force sending
it; if somebody had broken a PG process's signal handling setup for
SIGUSR1, a crash of this sort could be expected in short order.

But having said that, are we sure 10 is SIGUSR1 on the OP's platform?
AFAIK, that signal number is not at all compatible across different
flavors of Unix. (I see SIGUSR1 is 30 on OS X for instance.)

Gah. I incorrectly though that POSIX specified signal *numbers*, not
just names. That does not appear to actually be the case. Thanks.

A bit of searching suggests that on Solaris/SunOS, signal 10 is SIGBUS:

http://www.s-gms.ms.edus.si/cgi-bin/man-cgi?signal+3HEAD
http://docs.oracle.com/cd/E23824_01/html/821-1464/signal-3head.html

... which tends to suggest an entirely different interpretation than
"someone broke a signal hander":

https://blogs.oracle.com/peteh/entry/sigbus_versus_sigsegv_according_to

such as:

- Bad mmap()ed read
- alignment error
- hardware fault

so it's not immensely different to a segfault in that it can be caused
by errors in hardware, OS, or applications.

Raghu, did PostgreSQL dump a core file? If it didn't, you might want to
enable core dumps in future. If it did dump a core, attaching a debugger
to the core file might tell you where it crashed, possibly offering some
more information to diagnose the issue. I'm not familiar enough with
Solaris to offer detailed advice on that, especially as you haven't
mentioned your Solaris version, how you installed Pg, etc. This may be
of some use:

http://stackoverflow.com/questions/6403803/how-to-get-backtrace-function-line-number-on-solaris

--
Craig Ringer

Craig Ringer

craig@2ndquadrant.com

over 13 years ago

In reply to: Craig Ringer (#4)

Re: Streaming Replication Server Crash

On 10/23/2012 01:03 PM, Craig Ringer wrote:

http://stackoverflow.com/questions/6403803/how-to-get-backtrace-function-line-number-on-solaris

Actually, that link doesn't apply to this problem, it's for getting a
stack trace programmatically:

Try:

http://publib.boulder.ibm.com/httpserv/ihsdiag/get_backtrace.html

http://www.princeton.edu/~unix/Solaris/troubleshoot/adb.html
<http://www.princeton.edu/%7Eunix/Solaris/troubleshoot/adb.html>

Most of the good links I could find were on blogs.sun.com, which Oracle
have helpfully redirected to www.oracle.com - where the pages don't
actually exist.

--
Craig Ringer

Tom Lane

tgl@sss.pgh.pa.us

over 13 years ago

In reply to: Craig Ringer (#4)

Re: [ADMIN] Streaming Replication Server Crash

Craig Ringer <ringerc@ringerc.id.au> writes:

On 10/22/2012 08:52 PM, Tom Lane wrote:

But having said that, are we sure 10 is SIGUSR1 on the OP's platform?
AFAIK, that signal number is not at all compatible across different
flavors of Unix. (I see SIGUSR1 is 30 on OS X for instance.)

Gah. I incorrectly though that POSIX specified signal *numbers*, not
just names. That does not appear to actually be the case. Thanks.

This isn't the first time I've wondered exactly which signal was meant
in a postmaster child-crash report. Seems like it might be worth
expending some code on a symbolic translation, instead of just printing
the number. That'd be easy enough (for common signal names) on Unix,
but has anyone got a suggestion how we might do something useful on
Windows?

regards, tom lane

Craig Ringer

craig@2ndquadrant.com

over 13 years ago

In reply to: Tom Lane (#6)

Re: [ADMIN] Streaming Replication Server Crash

On 10/23/2012 01:20 PM, Tom Lane wrote:

This isn't the first time I've wondered exactly which signal was meant
in a postmaster child-crash report. Seems like it might be worth
expending some code on a symbolic translation, instead of just printing
the number. That'd be easy enough (for common signal names) on Unix,
but has anyone got a suggestion how we might do something useful on
Windows?

Here's a typical Windows exception:

2012-10-04 14:29:08 CEST LOG: server process (PID 1416) was terminated
by exception 0xC0000005

2012-10-04 14:29:08 CEST HINT: See C include file "ntstatus.h" for a
description of the hexadecimal value.

These codes can be translated with FormatMessage:

http://msdn.microsoft.com/en-us/library/windows/desktop/ms679351(v=vs.85).aspx
<http://msdn.microsoft.com/en-us/library/windows/desktop/ms679351%28v=vs.85%29.aspx>
http://support.microsoft.com/kb/259693

FormatMessage may not be safe to perform in the context of a munged heap
or some other failure conditions, so you probably don't want to do it
from a crash handler. It is safe for the postmaster to do it based on
the exception code it gets from the dying backend, though.

I'd say the best option is for the postmaster to print the
FormatMessage(
FORMAT_MESSAGE_ALLOCATE_BUFFER|FORMAT_MESSAGE_FROM_SYSTEM|FORMAT_MESSAGE_FROM_HMODULE,
...) output when it sees the exception code from the dying backend.

RtlNtStatusToDosError may also be of interest:
http://msdn.microsoft.com/en-us/library/windows/desktop/ms680600(v=vs.85).aspx
<http://msdn.microsoft.com/en-us/library/windows/desktop/ms680600%28v=vs.85%29.aspx>
... but it's in Winternl.h so it's not guaranteed to exist / be
compatible between versions and can only be accessed via runtime dynamic
linking. Not ideal.

--
Craig Ringer

Myers Brian D

brian.d.myers@state.or.us

over 13 years ago

In reply to: Tom Lane (#6)

Re: [GENERAL] Streaming Replication Server Crash

It looks like there's no standard way to do that. Here's how I'd do it in Python:

[CODE]
import signal
dict((k, v) for v, k in signal.__dict__.iteritems() if v.startswith('SIG'))
[/CODE]

In C, I guess I'd just do a switch statement on the common signal names between Windows and POSIX as exposed SIGNAL.H. Looks like all you get in Windows is:

http://msdn.microsoft.com/en-us/library/xdkz3x12(v=vs.110).aspx

Brian

-----Original Message-----
From: pgsql-admin-owner@postgresql.org [mailto:pgsql-admin-owner@postgresql.org] On Behalf Of Tom Lane
Sent: Monday, October 22, 2012 10:21 PM
To: Craig Ringer
Cc: raghu ram; pgsql-admin@postgresql.org; pgsql-general
Subject: Re: [ADMIN] [GENERAL] Streaming Replication Server Crash

Craig Ringer <ringerc@ringerc.id.au> writes:

On 10/22/2012 08:52 PM, Tom Lane wrote:

But having said that, are we sure 10 is SIGUSR1 on the OP's platform?
AFAIK, that signal number is not at all compatible across different
flavors of Unix. (I see SIGUSR1 is 30 on OS X for instance.)

Gah. I incorrectly though that POSIX specified signal *numbers*, not
just names. That does not appear to actually be the case. Thanks.

This isn't the first time I've wondered exactly which signal was meant in a postmaster child-crash report. Seems like it might be worth expending some code on a symbolic translation, instead of just printing the number. That'd be easy enough (for common signal names) on Unix, but has anyone got a suggestion how we might do something useful on Windows?

regards, tom lane

--
Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org) To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin