BUG #3362: xlog corruption just after initdb on irix
The following bug has been logged online:
Bug reference: 3362
Logged by: Herve Boulouis
Email address: amon@sockar.homeip.net
PostgreSQL version: 8.2.4
Operating system: IRIX 6.5.26
Description: xlog corruption just after initdb on irix
Details:
what I did:
- compile postgres 8.2.4 (MIPSPro 7.4.4m compiler)
export CC=c99
export CXX=CC
./configure --enable-integer-datetimes --with-perl --with-python
- compile and install, create pgsql user, ...
- initialize db
su - pgsql
export PGLIB=/usr/local/pgsql/lib
export PGDATA=$HOME/data
initdb
- trigger the bug
/etc/init.d/pgsql start
/etc/init.d/pgsql stop
/etc/init.d/pgsql start
/etc/init.d/pgsql stop
/etc/init.d/pgsql start
<signal 6 on postmaster startup, see logfile>
from that point, postmaster won't start unless another initdb is done.
- command used by the script to start/stop
su - pgsql -c "exec /usr/local/pgsql/bin/pg_ctl start -w -D
'/usr/local/pgsql/data'"
su - pgsql -c "/usr/local/pgsql/bin/pg_ctl stop -D '/usr/local/pgsql/data'
-s -m smart"
- logfile of the 5 start/stop operations :
* first start
Jun 5 15:22:59 6T:origin300 postgres[2007]: [1-1] LOG: could not create
IPv6 socket: Address family not supported by protocol family
Jun 5 15:22:59 6T:origin300 postgres[2007]: [2-1] LOG: could not create
socket for statistics collector: Address family not supported by protocol
family
Jun 5 15:22:59 6T:origin300 postgres[2007]: [3-1] LOG: trying another
address for the statistics collector
Jun 5 15:22:59 6T:origin300 postgres[2008]: [4-1] LOG: database system was
shut down at 2007-06-05 15:21:59 GMT
Jun 5 15:22:59 6T:origin300 postgres[2008]: [5-1] LOG: checkpoint record
is at 0/4866C8
Jun 5 15:22:59 6T:origin300 postgres[2008]: [6-1] LOG: redo record is at
0/4866C8; undo record is at 0/0; shutdown TRUE
Jun 5 15:22:59 6T:origin300 postgres[2008]: [7-1] LOG: next transaction
ID: 0/593; next OID: 10820
Jun 5 15:22:59 6T:origin300 postgres[2008]: [8-1] LOG: next MultiXactId:
1; next MultiXactOffset: 0
Jun 5 15:22:59 6T:origin300 postgres[2008]: [9-1] LOG: database system is
ready
* first stop
Jun 5 15:23:11 6T:origin300 postgres[2007]: [4-1] LOG: received smart
shutdown request
Jun 5 15:23:11 6T:origin300 postgres[2009]: [4-1] LOG: shutting down
Jun 5 15:23:11 6T:origin300 postgres[2009]: [5-1] LOG: database system is
shut down
* 2d start
Jun 5 15:23:17 6T:origin300 postgres[2027]: [1-1] LOG: could not create
IPv6 socket: Address family not supported by protocol family
Jun 5 15:23:17 6T:origin300 postgres[2027]: [2-1] LOG: could not create
socket for statistics collector: Address family not supported by protocol
family
Jun 5 15:23:17 6T:origin300 postgres[2027]: [3-1] LOG: trying another
address for the statistics collector
Jun 5 15:23:17 6T:origin300 postgres[2028]: [4-1] LOG: database system was
shut down at 2007-06-05 15:23:11 GMT
Jun 5 15:23:17 6T:origin300 postgres[2028]: [5-1] LOG: record with zero
length at 0/486718
* vvv I believe this line is not normal, as there was a clean shutdown
Jun 5 15:23:17 6T:origin300 postgres[2028]: [6-1] LOG: invalid primary
checkpoint record
Jun 5 15:23:17 6T:origin300 postgres[2028]: [7-1] LOG: using previous
checkpoint record at 0/4866C8
Jun 5 15:23:17 6T:origin300 postgres[2028]: [8-1] LOG: redo record is at
0/4866C8; undo record is at 0/0; shutdown TRUE
Jun 5 15:23:17 6T:origin300 postgres[2028]: [9-1] LOG: next transaction
ID: 0/593; next OID: 10820
Jun 5 15:23:17 6T:origin300 postgres[2028]: [10-1] LOG: next MultiXactId:
1; next MultiXactOffset: 0
Jun 5 15:23:17 6T:origin300 postgres[2028]: [11-1] LOG: database system
was not properly shut down; automatic recovery in progress
Jun 5 15:23:17 6T:origin300 postgres[2028]: [12-1] LOG: record with zero
length at 0/486718
Jun 5 15:23:17 6T:origin300 postgres[2028]: [13-1] LOG: redo is not
required
Jun 5 15:23:17 6T:origin300 postgres[2028]: [14-1] LOG: database system is
ready
* 2d stop
Jun 5 15:23:22 6T:origin300 postgres[2027]: [4-1] LOG: received smart
shutdown request
Jun 5 15:23:22 6T:origin300 postgres[2029]: [4-1] LOG: shutting down
Jun 5 15:23:22 6T:origin300 postgres[2029]: [5-1] LOG: database system is
shut down
* 3d start
Jun 5 15:23:25 6T:origin300 postgres[2047]: [1-1] LOG: could not create
IPv6 socket: Address family not supported by protocol family
Jun 5 15:23:25 6T:origin300 postgres[2047]: [2-1] LOG: could not create
socket for statistics collector: Address family not supported by protocol
family
Jun 5 15:23:25 6T:origin300 postgres[2047]: [3-1] LOG: trying another
address for the statistics collector
Jun 5 15:23:25 6T:origin300 postgres[2048]: [4-1] LOG: database system was
shut down at 2007-06-05 15:23:22 GMT
Jun 5 15:23:25 6T:origin300 postgres[2048]: [5-1] LOG: record with zero
length at 0/486768
Jun 5 15:23:25 6T:origin300 postgres[2048]: [6-1] LOG: invalid primary
checkpoint record
Jun 5 15:23:25 6T:origin300 postgres[2048]: [7-1] LOG: record with zero
length at 0/486718
Jun 5 15:23:25 6T:origin300 postgres[2048]: [8-1] LOG: invalid secondary
checkpoint record
* vvv this it gets worse
Jun 5 15:23:25 2T:origin300 postgres[2048]: [9-1] PANIC: could not locate
a valid checkpoint record
Jun 5 15:23:26 6T:origin300 postgres[2047]: [4-1] LOG: startup process
(PID 2048) was terminated by signal 6
Jun 5 15:23:26 6T:origin300 postgres[2047]: [5-1] LOG: aborting startup
due to startup process failure
I had already noticed that problem on this platform with 8.0 or 8.1, but
didn't report it at the time.
I kept the core file and here is the backtrace :
origin300:/usr/local/pgsql/data# dbx /usr/local/pgsql/bin/postgres core
dbx version 7.3.4 (86441_Nov11 MR) Nov 11 2002 11:31:55
Core from signal SIGABRT: Abort (see abort(3c))
(dbx) where
0 _kill(0x800, 0x6, 0x86bf8, 0x0, 0x40000, 0x0, 0x68a1, 0x0)
["/xlv41/6.5.26m/work/irix/lib/libc/libc_n32_M4/signal/kill.s":15,
0xfa54258]
1 _raise(0x800, 0x6, 0x86bf8, 0x0, 0x40000, 0x0, 0x68a1, 0x0)
["/xlv41/6.5.26m/work/irix/lib/libc/libc_n32_M4/signal/raise.c":27,
0xfad16dc]
2 abort(0x800, 0x6, 0x86bf8, 0x0, 0x40000, 0x0, 0x68a1, 0x0)
["/xlv41/6.5.26m/work/irix/lib/libc/libc_n32_M4/gen/abort.c":52, 0xfa6ef60]
3 errfinish(0x0, 0x0, 0x3, 0x5, 0x40000, 0x0, 0x68a1, 0x0)
["/opt/install/postgresql-8.2.4/src/backend/utils/error/elog.c":451,
0x102d9678]
4 StartupXLOG(0x800, 0x6, 0x86bf8, 0x0, 0x40000, 0x0, 0x68a1, 0x0)
["/opt/install/postgresql-8.2.4/src/backend/access/transam/xlog.c":4776,
0x10088638]
5 BootstrapMain(0x103b7524, 0x2, 0x86bf8, 0x0, 0x40000, 0x0, 0x68a1, 0x0)
["/opt/install/postgresql-8.2.4/src/backend/bootstrap/bootstrap.c":423,
0x100a3cd4]
6 StartChildProcess(0x2, 0x6, 0x86bf8, 0x0, 0x40000, 0x0, 0x68a1, 0x0)
["/opt/install/postgresql-8.2.4/src/backend/postmaster/postmaster.c":3691,
0x101e6de8]
7 PostmasterMain(0x800, 0xfa49bf0, 0x86bf8, 0x0, 0x40000, 0x0, 0x68a1,
0x0)
["/opt/install/postgresql-8.2.4/src/backend/postmaster/postmaster.c":961,
0x101e267c]
8 main(0x3, 0x6, 0x86bf8, 0x0, 0x40000, 0x0, 0x68a1, 0x0)
["/opt/install/postgresql-8.2.4/src/backend/main/main.c":188, 0x1017e684]
9 __start()
["/xlv55/kudzu-apr12/work/irix/lib/libc/libc_n32_M4/csu/crt1text.s":177,
0x1003f2a8]
If necessary, I can provide access to an irix machine if some pgsql
developpers need it to test/maintain the irix port.
"Herve Boulouis" <amon@sockar.homeip.net> writes:
- trigger the bug
/etc/init.d/pgsql start
/etc/init.d/pgsql stop
/etc/init.d/pgsql start
/etc/init.d/pgsql stop
/etc/init.d/pgsql start
<signal 6 on postmaster startup, see logfile>
from that point, postmaster won't start unless another initdb is done.
That's pretty dang weird. It looks like it's failing to write the
shutdown checkpoint record (but is updating pg_control anyway). But if
that's the explanation then it should have failed during initdb, because
each step of initdb goes through the start/stop process.
I had already noticed that problem on this platform with 8.0 or 8.1, but
didn't report it at the time.
You should have mentioned it earlier :-( ... whatever the problem is
is surely now long-forgotten. Although I see we have a report of 8.1
working with IRIX 6.5, so apparently it works for some people. Are you
sure your compiler is up-to-date?
If necessary, I can provide access to an irix machine if some pgsql
developpers need it to test/maintain the irix port.
I'm willing to take a look, but first please see if you can eliminate
the compiler-bug theory. Check to see if there's a newer compiler
version available; check whether building with CFLAGS=-O0 makes the
problem go away.
regards, tom lane
On Tue, 2007-06-05 at 14:42 +0000, Herve Boulouis wrote:
PostgreSQL version: 8.2.4
Operating system: IRIX 6.5.26
Thats not yet a supported system, but we should fix this.
Jun 5 15:23:11 6T:origin300 postgres[2009]: [5-1] LOG: database system is
shut down
* 2d start
Jun 5 15:23:17 6T:origin300 postgres[2027]: [1-1] LOG: could not create
IPv6 socket: Address family not supported by protocol family
Jun 5 15:23:17 6T:origin300 postgres[2027]: [2-1] LOG: could not create
socket for statistics collector: Address family not supported by protocol
family
Jun 5 15:23:17 6T:origin300 postgres[2027]: [3-1] LOG: trying another
address for the statistics collector
Jun 5 15:23:17 6T:origin300 postgres[2028]: [4-1] LOG: database system was
shut down at 2007-06-05 15:23:11 GMT
Jun 5 15:23:17 6T:origin300 postgres[2028]: [5-1] LOG: record with zero
length at 0/486718
* vvv I believe this line is not normal, as there was a clean shutdown
Jun 5 15:23:17 6T:origin300 postgres[2028]: [6-1] LOG: invalid primary
checkpoint record
Jun 5 15:23:17 6T:origin300 postgres[2028]: [7-1] LOG: using previous
checkpoint record at 0/4866C8
Jun 5 15:23:17 6T:origin300 postgres[2028]: [8-1] LOG: redo record is at
0/4866C8; undo record is at 0/0; shutdown TRUE
Jun 5 15:23:17 6T:origin300 postgres[2028]: [9-1] LOG: next transaction
ID: 0/593; next OID: 10820
Jun 5 15:23:17 6T:origin300 postgres[2028]: [10-1] LOG: next MultiXactId:
1; next MultiXactOffset: 0
Jun 5 15:23:17 6T:origin300 postgres[2028]: [11-1] LOG: database system
was not properly shut down; automatic recovery in progress
Jun 5 15:23:17 6T:origin300 postgres[2028]: [12-1] LOG: record with zero
length at 0/486718
Jun 5 15:23:17 6T:origin300 postgres[2028]: [13-1] LOG: redo is not
required
Jun 5 15:23:17 6T:origin300 postgres[2028]: [14-1] LOG: database system is
ready
Well, I'd question what's going on with the hardware/filesystem or the
settings you've chosen for fsync and wal_fsync.
If you shut it down cleanly, yet its flaky when it comes back up that
doesn't sound too much like a database problem to me...
--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com
Le 05/06/2007 12:06, Tom Lane a ���crit:
I had already noticed that problem on this platform with 8.0 or 8.1, but
didn't report it at the time.You should have mentioned it earlier :-( ... whatever the problem is
is surely now long-forgotten. Although I see we have a report of 8.1
working with IRIX 6.5, so apparently it works for some people. Are you
sure your compiler is up-to-date?
Yep, MIPSPro 7.4.4m is the latest available and probably the last one :(
If necessary, I can provide access to an irix machine if some pgsql
developpers need it to test/maintain the irix port.I'm willing to take a look, but first please see if you can eliminate
the compiler-bug theory. Check to see if there's a newer compiler
version available; check whether building with CFLAGS=-O0 makes the
problem go away.
I'll try without optimizations.
Regards,
--
Herve Boulouis
Le 05/06/2007 17:33, Simon Riggs a ���crit:
On Tue, 2007-06-05 at 14:42 +0000, Herve Boulouis wrote:
Well, I'd question what's going on with the hardware/filesystem or the
settings you've chosen for fsync and wal_fsync.If you shut it down cleanly, yet its flaky when it comes back up that
doesn't sound too much like a database problem to me...
I should have precised that I have "fsync = on" in my postgresql.conf.
Pretty much everything else is default value.
Regards,
--
Herve Boulouis
Herve Boulouis wrote:
Le 05/06/2007 17:33, Simon Riggs a �crit:
On Tue, 2007-06-05 at 14:42 +0000, Herve Boulouis wrote:
Well, I'd question what's going on with the hardware/filesystem or the
settings you've chosen for fsync and wal_fsync.If you shut it down cleanly, yet its flaky when it comes back up that
doesn't sound too much like a database problem to me...I should have precised that I have "fsync = on" in my postgresql.conf.
Pretty much everything else is default value.
Do you mean "fsync=off"? "on" is the default.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
Le 06/06/2007 08:24, Heikki Linnakangas a ���crit:
Herve Boulouis wrote:
Le 05/06/2007 17:33, Simon Riggs a ?crit:
On Tue, 2007-06-05 at 14:42 +0000, Herve Boulouis wrote:
Well, I'd question what's going on with the hardware/filesystem or the
settings you've chosen for fsync and wal_fsync.If you shut it down cleanly, yet its flaky when it comes back up that
doesn't sound too much like a database problem to me...I should have precised that I have "fsync = on" in my postgresql.conf.
Pretty much everything else is default value.Do you mean "fsync=off"? "on" is the default.
No, I mean "fsync=on", everything else is commented out so I assume default
values.
--
Herve Boulouis
Le 05/06/2007 12:06, Tom Lane a ���crit:
I'm willing to take a look, but first please see if you can eliminate
the compiler-bug theory. Check to see if there's a newer compiler
version available; check whether building with CFLAGS=-O0 makes the
problem go away.
quick question : do you think that the --enable-integer-datetimes configure
flag could have an impact on this problem ?
--
Herve Boulouis
Herve Boulouis <amon@sockar.homeip.net> writes:
quick question : do you think that the --enable-integer-datetimes configure
flag could have an impact on this problem ?
I doubt it.
regards, tom lane
Le 05/06/2007 12:06, Tom Lane a ���crit:
I'm willing to take a look, but first please see if you can eliminate
the compiler-bug theory. Check to see if there's a newer compiler
version available; check whether building with CFLAGS=-O0 makes the
problem go away.
I tried -O0 with same results but I've found something more interesting :
It seems that my compiler version is not 7.4.4m but 7.4.1m, I got lost
in the sgi compiler upgrades. As the folks at nekochan.net provide a binary
package of postgresql for irix that works correctly, I tried compiling
with their build flags but the problem is still there. I asked the
package maintainer which compiler version they use and it's 7.4.4m. So
our problem could very well be a compiler bug.
The mipspro compiler is a proprietary product and I cannot get the
necessary updates so for now I am unable to verify this theory. Anyway,
maybe you should a add a note in the pg irix faq to use the 7.4.4m version
of mipspro when building the sources ?
--
Herve Boulouis
Herve Boulouis <amon@sockar.homeip.net> writes:
The mipspro compiler is a proprietary product and I cannot get the
necessary updates so for now I am unable to verify this theory. Anyway,
maybe you should a add a note in the pg irix faq to use the 7.4.4m version
of mipspro when building the sources ?
Maybe, but I'd like confirmation of the theory first. Please let me
know how it goes when you get hold of 7.4.4m ...
regards, tom lane
Le 06/06/2007 18:04, Tom Lane a ���crit:
Hi,
Herve Boulouis <amon@sockar.homeip.net> writes:
The mipspro compiler is a proprietary product and I cannot get the
necessary updates so for now I am unable to verify this theory. Anyway,
maybe you should a add a note in the pg irix faq to use the 7.4.4m version
of mipspro when building the sources ?Maybe, but I'd like confirmation of the theory first. Please let me
know how it goes when you get hold of 7.4.4m ...
I finally got the 7.4.4m update and repeated the entire test (compile,
install, initdb, start/stop multiple times) and the problem has gone away.
My test server hasn't changed since I reported the bug, so I think you
can add a note in the pg documentation regarding the MIPSPro version to use.
(7.4.4m ok, 7.4.1m broken, dunno about the intermediate versions)
--
Herve Boulouis
Herve Boulouis <amon@sockar.homeip.net> writes:
I finally got the 7.4.4m update and repeated the entire test (compile,
install, initdb, start/stop multiple times) and the problem has gone away.
My test server hasn't changed since I reported the bug, so I think you
can add a note in the pg documentation regarding the MIPSPro version to use.
(7.4.4m ok, 7.4.1m broken, dunno about the intermediate versions)
Done, thanks for the update!
regards, tom lane