gcc 4.6 and hot standby
So I've been delaying moving some production boxes over to 9.0.4 from
9.0.2 because hot standby fails with:
(this is on the "hot standby" machine that connects to the master)
2011-06-08 11:40:48 MDT [6072]: [2-1] user= LOG: entering standby mode
2011-06-08 11:40:48 MDT [6072]: [3-1] user= DEBUG: checkpoint record
is at 86/E5D725F0
2011-06-08 11:40:48 MDT [6072]: [4-1] user= DEBUG: redo record is at
86/E39E8248; shutdown FALSE
2011-06-08 11:40:48 MDT [6072]: [5-1] user= DEBUG: next transaction
ID: 0/35456371; next OID: 34090526
2011-06-08 11:40:48 MDT [6072]: [6-1] user= DEBUG: next MultiXactId:
523; next MultiXactOffset: 1046
2011-06-08 11:40:48 MDT [6072]: [7-1] user= DEBUG: oldest unfrozen
transaction ID: 654, in database 1
2011-06-08 11:40:48 MDT [6072]: [8-1] user= DEBUG: transaction ID
wrap limit is 2147484301, limited by database with OID 1
2011-06-08 11:40:48 MDT [6072]: [9-1] user= DEBUG: initializing for hot standby
2011-06-08 11:40:48 MDT [6072]: [10-1] user= LOG: redo starts at 86/E39E8248
2011-06-08 11:40:48 MDT [6072]: [11-1] user= LOG: invalid record
length at 86/E39F2010
2011-06-08 11:40:48 MDT [6074]: [1-1] user= LOG: streaming
replication successfully connected to primary
2011-06-08 11:40:49 MDT [6072]: [12-1] user= LOG: invalid record
length at 86/E3A16010
2011-06-08 11:40:49 MDT [6074]: [2-1] user= FATAL: terminating
walreceiver process due to administrator command
2011-06-08 11:40:49 MDT [6072]: [13-1] user= LOG: invalid record
length at 86/E3A3C010
2011-06-08 11:40:53 MDT [6072]: [14-1] user= LOG: invalid record
length at 86/E3A54010
2011-06-08 11:40:53 MDT [6075]: [1-1] user= FATAL: terminating
walreceiver process due to administrator command
2011-06-08 11:40:53 MDT [6072]: [15-1] user= LOG: invalid record
length at 86/E3A74010
2011-06-08 11:40:58 MDT [6076]: [1-1] user= LOG: streaming
replication successfully connected to primary
2011-06-08 11:40:59 MDT [6072]: [16-1] user= LOG: invalid record
length at 86/E3AC6010
2011-06-08 11:40:59 MDT [6076]: [2-1] user= FATAL: terminating
walreceiver process due to administrator command
2011-06-08 11:40:59 MDT [6072]: [17-1] user= LOG: invalid record
length at 86/E3ACC010
2011-06-08 11:41:03 MDT [6072]: [18-1] user= LOG: invalid record
length at 86/E3B32010
2011-06-08 11:41:03 MDT [6078]: [1-1] user= FATAL: terminating
walreceiver process due to administrator command
[ repeats... ]
Originally I thought there might be some corner case bug in 9.0.3 or
9.0.4. However after recompiling 9.0.2 with gcc 4.6 and hitting the
same problem-- I tried compiling 9.0.4 with gcc 4.5 and it seemed to
work great. I then tired various optimization levels on 4.6:
-O0: works
-O1: works
-O2: fails
-Os: works
I suppose the next step is to narrow it down to a specific flag -O2
uses... But I thought I would post here first-- maybe someone else has
hit this? Or maybe someone has a bright idea on how to narrow this
down?
# linux 2.6.39.1 x86_64 AMD opteron box
$ gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-unknown-linux-gnu/4.6.0/lto-wrapper
Target: x86_64-unknown-linux-gnu
Configured with: /build/src/gcc-4.6-20110603/configure --prefix=/usr
--libdir=/usr/lib --libexecdir=/usr/lib --mandir=/usr/share/man
--infodir=/usr/share/info --with-bugurl=https://bugs.archlinux.org/
--enable-languages=c,c++,ada,fortran,go,lto,objc,obj-c++
--enable-shared --enable-threads=posix --with-system-zlib
--enable-__cxa_atexit --disable-libunwind-exceptions
--enable-clocale=gnu --enable-gnu-unique-object
--enable-linker-build-id --with-ppl --enable-cloog-backend=isl
--enable-lto --enable-gold --enable-ld=default --enable-plugin
--with-plugin-ld=ld.gold --disable-multilib --disable-libstdcxx-pch
--enable-checking=release
Thread model: posix
gcc version 4.6.0 20110603 (prerelease) (GCC)
Alex Hunsaker <badalex@gmail.com> writes:
So I've been delaying moving some production boxes over to 9.0.4 from
2011-06-08 11:41:03 MDT [6078]: [1-1] user= FATAL: terminating
walreceiver process due to administrator command
[ repeats... ]
I suppose the next step is to narrow it down to a specific flag -O2
uses... But I thought I would post here first-- maybe someone else has
hit this? Or maybe someone has a bright idea on how to narrow this
down?
Maybe using a "prerelease" gcc version isn't such a hot idea for
production. It's very, very, very difficult to see how the behavior you
describe isn't a compiler bug. (Well, I could also believe that
something external is repeatedly hitting the walreceiver with a SIGTERM,
but it's hard to square that with the behavior changing when you
recompile with different -O levels ...)
It might be useful to strace the postmaster and walreceiver processes
just to see if any signal is actually being sent or received.
regards, tom lane
On Wed, Jun 8, 2011 at 12:12, Alex Hunsaker <badalex@gmail.com> wrote:
So I've been delaying moving some production boxes over to 9.0.4 from
9.0.2 because hot standby fails with:
(this is on the "hot standby" machine that connects to the master)
[ ...]
2011-06-08 11:41:03 MDT [6072]: [18-1] user= LOG: invalid record
length at 86/E3B32010
2011-06-08 11:41:03 MDT [6078]: [1-1] user= FATAL: terminating
walreceiver process due to administrator command
[ repeats... ]
[...] I then tired various optimization levels on 4.6:
-O0: works
-O1: works
-O2: fails
-Os: works
So I tracked it down to "-fgcse", that is CFLAGS="-O2 -fno-gcse"
./configure works. I then took a few guesses and compiled all of
postgres with -O2, then manually recompiled xlog.c with -f-no-gcse.
that combination seems to work.
[ One thing im not sure is why -Os works, I tried -O2 and added all
the -fno-XXX options it says -Os adds. I suppose its either they turn
off/on other optimizations the man page does not mention, or I guess
thats compiler bugs for ya ]
On Wed, Jun 8, 2011 at 12:49, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Alex Hunsaker <badalex@gmail.com> writes:
So I've been delaying moving some production boxes over to 9.0.4 from
2011-06-08 11:41:03 MDT [6078]: [1-1] user= FATAL: terminating
walreceiver process due to administrator command
[ repeats... ]I suppose the next step is to narrow it down to a specific flag -O2
uses... But I thought I would post here first-- maybe someone else has
hit this? Or maybe someone has a bright idea on how to narrow this
down?Maybe using a "prerelease" gcc version isn't such a hot idea for
production. It's very, very, very difficult to see how the behavior you
describe isn't a compiler bug.
Yeah :-). However ill note it looks like its the default compiler for
fedora 15, ubuntu natty and debian sid.
It might be useful to strace the postmaster and walreceiver processes
just to see if any signal is actually being sent or received.
Will do.
Alex Hunsaker <badalex@gmail.com> writes:
So I tracked it down to "-fgcse", that is CFLAGS="-O2 -fno-gcse"
./configure works. I then took a few guesses and compiled all of
postgres with -O2, then manually recompiled xlog.c with -f-no-gcse.
that combination seems to work.
Huh, interesting. So the bug must be lurking somewhere around the logic
that deals with failedSources: somehow we're getting to the
ShutdownWalRcv call in XLogPageRead.
regards, tom lane
On Wed, Jun 8, 2011 at 12:58, Alex Hunsaker <badalex@gmail.com> wrote:
On Wed, Jun 8, 2011 at 12:49, Tom Lane <tgl@sss.pgh.pa.us> wrote:
It might be useful to strace the postmaster and walreceiver processes
just to see if any signal is actually being sent or received.Will do.
Find it attached.
Corresponding logs:
2011-06-08 14:01:21 MDT [27779]: [1-1] user= LOG: could not create
IPv6 socket: Address family not supported by protocol
2011-06-08 14:01:21 MDT [27781]: [1-1] user= LOG: database system was
shut down in recovery at 2011-06-08 14:00:37 MDT
2011-06-08 14:01:21 MDT [27781]: [2-1] user= LOG: entering standby mode
2011-06-08 14:01:21 MDT [27781]: [3-1] user= DEBUG: checkpoint record
is at 86/F6E76708
2011-06-08 14:01:21 MDT [27781]: [4-1] user= DEBUG: redo record is at
86/F4DF9BF0; shutdown FALSE
2011-06-08 14:01:21 MDT [27781]: [5-1] user= DEBUG: next transaction
ID: 0/35484525; next OID: 34123294
2011-06-08 14:01:21 MDT [27781]: [6-1] user= DEBUG: next MultiXactId:
523; next MultiXactOffset: 1046
2011-06-08 14:01:21 MDT [27781]: [7-1] user= DEBUG: oldest unfrozen
transaction ID: 654, in database 1
2011-06-08 14:01:21 MDT [27781]: [8-1] user= DEBUG: transaction ID
wrap limit is 2147484301, limited by database with OID 1
2011-06-08 14:01:21 MDT [27781]: [9-1] user= DEBUG: initializing for
hot standby
2011-06-08 14:01:21 MDT [27781]: [10-1] user= LOG: redo starts at 86/F4DF9BF0
2011-06-08 14:01:21 MDT [27781]: [11-1] user= LOG: invalid record
length at 86/F4E62010
2011-06-08 14:01:21 MDT [27783]: [1-1] user= LOG: streaming
replication successfully connected to primary
2011-06-08 14:01:22 MDT [27781]: [12-1] user= LOG: invalid record
length at 86/F4E82010
2011-06-08 14:01:24 MDT [27783]: [2-1] user= FATAL: terminating
walreceiver process due to administrator command
2011-06-08 14:01:24 MDT [27781]: [13-1] user= LOG: invalid record
length at 86/F4EEC010
2011-06-08 14:01:26 MDT [27784]: [1-1] user= LOG: streaming
replication successfully connected to primary
2011-06-08 14:01:26 MDT [27781]: [14-1] user= LOG: invalid record
length at 86/F4F1E010
2011-06-08 14:01:26 MDT [27784]: [2-1] user= FATAL: terminating
walreceiver process due to administrator command
2011-06-08 14:01:26 MDT [27781]: [15-1] user= LOG: invalid record
length at 86/F4F22010
2011-06-08 14:01:31 MDT [27785]: [1-1] user= LOG: streaming
replication successfully connected to primary
^C2011-06-08 14:01:32 MDT [27779]: [2-1] user= LOG: received fast
shutdown request
2011-06-08 14:01:32 MDT [27785]: [2-1] user= FATAL: terminating
walreceiver process due to administrator command
2011-06-08 14:01:32 MDT [27782]: [1-1] user= LOG: shutting down
2011-06-08 14:01:32 MDT [27782]: [2-1] user= LOG: database system is shut down
Attachments:
strace.tar.bz2application/x-bzip2; name=strace.tar.bz2Download
BZh91AY&SY�z��������������������� aH>>�
���K�uW�� � ,y� p�a� ��4m` �6G�A� +��� � �
�p[>