What is happening on buildfarm member dugong?
dugong has been failing contribcheck repeatably for the last day or so,
with a very interesting symptom: CREATE DATABASE is failing with
ERROR: could not fsync segment 0 of relation 1663/40960/41403: No such file or directory
ERROR: checkpoint request failed
HINT: Consult recent messages in the server log for details.
I'd think we broke something in fsync cancellation signalling, except
that AFAICS there were no CVS commits at all between the last working
build at 2007-09-09 23:05:01 UTC and the first failure at 2007-09-10
00:45:27 UTC (and even if I got the timezone conversion wrong, there are
no nearby commits in the backend/storage area). Has this machine had
any system-software updates around then? Can anyone suggest another
theory?
regards, tom lane
On Tue, 11 Sep 2007, Tom Lane wrote:
dugong has been failing contribcheck repeatably for the last day or so,
with a very interesting symptom: CREATE DATABASE is failing with
The reason for that is that I've been trying to switch from 9.1 to 10.0
version of the ICC compiler. A month ago, I've tried for the first time,
discovered a segfault due the bug in ICC, submitted it to Intel. Lately
Intel fixed it in icc 10.0.026. And in last several days I tried to
make the new version work with postgres. From the first impression it
worked, so I upgraded the compiler for the buildfarm, but the buildfarm
failed.
Few notes:
1) without the --enable-cassert everything works
2) with --enable-cassert it, the only thing that fails in the tests is
contrib-installcheck...
3) And recently I tried to compile PG also with -O0 flag, it actually
worked.
4) Also, just now I tried to compile PG 8.2.4 and the same problem occurs.
So for me the most probable explanation is the ICC bug, but
unfortunately since it is not a pure segfault, it is a bit hard for me to
tackle...
So, I can either completely switch back to 9.1 and forget it, or we
can try to find or at least localize this bug(if it is ICC fault). But to
do that, I need some advices/help, how to do it better...
regards,
Sergey
*******************************************************************
Sergey E. Koposov
Max Planck Institute for Astronomy/Cambridge Institute for Astronomy/Sternberg Astronomical Institute
Tel: +49-6221-528-349
Web: http://lnfm1.sai.msu.ru/~math
E-mail: math@sai.msu.ru
"Sergey E. Koposov" <math@sai.msu.ru> writes:
On Tue, 11 Sep 2007, Tom Lane wrote:
dugong has been failing contribcheck repeatably for the last day or so,
with a very interesting symptom: CREATE DATABASE is failing with
The reason for that is that I've been trying to switch from 9.1 to 10.0
version of the ICC compiler.
Hah, interesting.
Few notes:
1) without the --enable-cassert everything works
2) with --enable-cassert it, the only thing that fails in the tests is
contrib-installcheck...
3) And recently I tried to compile PG also with -O0 flag, it actually
worked.
4) Also, just now I tried to compile PG 8.2.4 and the same problem occurs.
So, I can either completely switch back to 9.1 and forget it, or we
can try to find or at least localize this bug(if it is ICC fault). But to
do that, I need some advices/help, how to do it better...
Well, the first thing I'd suggest is trying to localize which Assert
makes it fail. From the bug's behavior I think it is highly probable
that the problem is in fsync signalling, which puts it either in
bgwriter.c or md.c. Try recompiling those modules separately without
cassert (leaving all else enabled) and see if the problem comes and
goes; if so, comment out one Assert at a time till you find which one.
Actually ... another possibility is that it's not directly an Assert,
but CLOBBER_FREED_MEMORY that exposes the bug. (This would suggest
that the compiler is trying to re-order memory accesses around a pfree.)
So before you get into the one-assert-at-a-time test, try with
--enable-cassert but modify pg_config_manual.h to not define
CLOBBER_FREED_MEMORY.
This could be a compiler bug, or it could be our fault --- might need
a "volatile" on some pointer or other, for example, to prevent the
compiler from making an otherwise legitimate assumption. So it seems
worth chasing it down.
BTW, does ICC have any switch corresponding to gcc's -fno-strict-aliasing?
I see that configure tries to feed that switch to it, but it might
want some other spelling.
regards, tom lane
This could be a compiler bug, or it could be our fault --- might need
a "volatile" on some pointer or other, for example, to prevent the
compiler from making an otherwise legitimate assumption. So it seems
worth chasing it down.
Tom, Thank you for the directions, I'll try to do what you recommended.
BTW, does ICC have any switch corresponding to gcc's -fno-strict-aliasing?
I see that configure tries to feed that switch to it, but it might
want some other spelling.
Apparently in none of the ICC manuals -fno-strict-aliasing is described,
but ICC accepts such flag, and produce the same code as with
'-fno-alias' flag (described in ICC manuals).
regards,
Sergey
*******************************************************************
Sergey E. Koposov
Max Planck Institute for Astronomy/Cambridge Institute for Astronomy/Sternberg Astronomical Institute
Tel: +49-6221-528-349
Web: http://lnfm1.sai.msu.ru/~math
E-mail: math@sai.msu.ru
"Sergey E. Koposov" <math@sai.msu.ru> writes:
BTW, does ICC have any switch corresponding to gcc's -fno-strict-aliasing?
I see that configure tries to feed that switch to it, but it might
want some other spelling.
Apparently in none of the ICC manuals -fno-strict-aliasing is described,
but ICC accepts such flag, and produce the same code as with
'-fno-alias' flag (described in ICC manuals).
Well, since configure.in has a separate code path for ICC anyway,
it seems like we might as well provide it the official spelling.
Any objections to a patch like this?
if test "$GCC" = yes -a "$ICC" = no; then
CFLAGS="$CFLAGS -Wall -Wmissing-prototypes -Wpointer-arith -Winline"
# These work in some but not all gcc versions
PGAC_PROG_CC_CFLAGS_OPT([-Wdeclaration-after-statement])
PGAC_PROG_CC_CFLAGS_OPT([-Wendif-labels])
# Disable strict-aliasing rules; needed for gcc 3.3+
PGAC_PROG_CC_CFLAGS_OPT([-fno-strict-aliasing])
elif test "$ICC" = yes; then
# Intel's compiler has a bug/misoptimization in checking for
# division by NAN (NaN == 0), -mp1 fixes it, so add it to the CFLAGS.
PGAC_PROG_CC_CFLAGS_OPT([-mp1])
- # Not clear if this is needed, but seems like a good idea
- PGAC_PROG_CC_CFLAGS_OPT([-fno-strict-aliasing])
+ # ICC prefers to spell the no-strict-aliasing switch like this
+ PGAC_PROG_CC_CFLAGS_OPT([-fno-alias])
elif test x"${CC}" = x"xlc"; then
# AIX xlc has to have strict aliasing turned off too
PGAC_PROG_CC_CFLAGS_OPT([-qnoansialias])
fi
regards, tom lane
Well, the first thing I'd suggest is trying to localize which Assert
makes it fail. From the bug's behavior I think it is highly probable
that the problem is in fsync signalling, which puts it either in
bgwriter.c or md.c. Try recompiling those modules separately without
cassert (leaving all else enabled) and see if the problem comes and
goes; if so, comment out one Assert at a time till you find which one.Actually ... another possibility is that it's not directly an Assert,
but CLOBBER_FREED_MEMORY that exposes the bug. (This would suggest
that the compiler is trying to re-order memory accesses around a pfree.)
So before you get into the one-assert-at-a-time test, try with
--enable-cassert but modify pg_config_manual.h to not define
CLOBBER_FREED_MEMORY.
It seems that neither undefining CLOBBER_FREED_MEMORY, nor disabling
casserting for md.c and bgwriter.c helps.... The contrib-installcheck
still fails.
I disabled casserting for md.c and bgwriter.c by inserting
#undef USE_ASSERT_CHECKING
in the top of the md.c and bgwriter.c exactly after the inclusion of
postgres.h, but before other includes. (I think it is the right
way to do it)
My configure flags:
./configure --enable-cassert --enable-depend --enable-debug --enable-nls
--enable-integer-datetimes --with-libxml LDFLAGS='-lirc -limf'
--enable-depend --prefix=/home/math/cvs/install/ CC=ic
Regards,
Sergey
*******************************************************************
Sergey E. Koposov
Max Planck Institute for Astronomy/Cambridge Institute for Astronomy/Sternberg Astronomical Institute
Tel: +49-6221-528-349
Web: http://lnfm1.sai.msu.ru/~math
E-mail: math@sai.msu.ru
Actually, in the log file I also see some messages about has_seq_search:
КОМАНДА: CREATE DATABASE "contrib_regression" TEMPLATE=template0
NOTICE: database "contrib_regression" does not exist, skipping
ERROR: too many active hash_seq_search scans
ERROR: too many active hash_seq_search scans
ERROR: too many active hash_seq_search scans
ERROR: too many active hash_seq_search scans
ERROR: could not fsync segment 0 of relation 1663/16384/2617: No such file or directory
ERROR: checkpoint request failed
I also tried to turn off asserting for dynahash.c, but it didn't help...
regards,
Sergey
*******************************************************************
Sergey E. Koposov
Max Planck Institute for Astronomy/Cambridge Institute for Astronomy/Sternberg Astronomical Institute
Tel: +49-6221-528-349
Web: http://lnfm1.sai.msu.ru/~math
E-mail: math@sai.msu.ru
"Sergey E. Koposov" <math@sai.msu.ru> writes:
Actually, in the log file I also see some messages about has_seq_search:
=EB=EF=ED=E1=EE=E4=E1: CREATE DATABASE "contrib_regression" TEMPLATE=3Dtem=
plate0
NOTICE: database "contrib_regression" does not exist, skipping
ERROR: too many active hash_seq_search scans
ERROR: too many active hash_seq_search scans
ERROR: too many active hash_seq_search scans
ERROR: too many active hash_seq_search scans
ERROR: could not fsync segment 0 of relation 1663/16384/2617: No such file=
or directory
ERROR: checkpoint request failed
That could be a consequent effect I think --- bgwriter is lacking an
AtEOXact_HashTables call in error recovery (something I will go fix)
and so after enough fsync errors we'd start getting these.
Anyway it seems we need to cast the net a bit wider for where the
troublesome Assert is. I'd suggest rebuilding the whole system with
--enable-cassert, then comment out the USE_ASSERT_CHECKING #define
in pg_config.h, and "make clean/make" in one backend subdirectory
at a time till you see where it stops failing. Then repeat at the
file level. Divide and conquer...
regards, tom lane
"Sergey E. Koposov" <math@sai.msu.ru> writes:
Actually, in the log file I also see some messages about has_seq_search:
ERROR: too many active hash_seq_search scans
ERROR: too many active hash_seq_search scans
ERROR: too many active hash_seq_search scans
BTW, I just made a commit to include the hash table name in this
message. Could you update src/backend/utils/hash/dynahash.c and retry
the test? I suspect it'll say the bgwriter's pending-ops table, but
we should verify that.
regards, tom lane
It seems to me last run
(http://pgbuildfarm.org/cgi-bin/show_log.pl?nm=dugong&dt=2007-09-11%2016:05:01)
points to problem with hash implementation.
*** ./expected/tsdicts.out Tue Sep 11 20:05:23 2007
--- ./results/tsdicts.out Tue Sep 11 20:18:38 2007
***************
*** 301,306 ****
--- 301,307 ----
lword, lpart_hword, lhword
WITH synonym, thesaurus, english_stem;
SELECT to_tsvector('thesaurus_tst', 'one postgres one two one two three one');
+ NOTICE: thesaurus word-sample "the" is recognized as stop-word, assign any
stop-word (rule 8)
At this place of tsdicts test dictionary thesaurus should be already loaded and
initialized, but this NOTICE points that thesaurus was initialized here.
ERROR: too many active hash_seq_search scans
--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/
Teodor Sigaev <teodor@sigaev.ru> writes:
It seems to me last run
(http://pgbuildfarm.org/cgi-bin/show_log.pl?nm=dugong&dt=2007-09-11%2016:05:01)
points to problem with hash implementation.
dynahash.c is used all over the system, though. If it were broken by a
compiler issue, it's hard to credit that we'd be getting through all but
one or two regression tests ...
regards, tom lane
On Tue, 11 Sep 2007, Tom Lane wrote:
NOTICE: database "contrib_regression" does not exist, skipping
ERROR: too many active hash_seq_search scans
ERROR: too many active hash_seq_search scans
ERROR: too many active hash_seq_search scans
ERROR: too many active hash_seq_search scans
With hash_seq_search ERROR, it was partially a false alarm. I've had some
old postgres daemon hanging around and writing that to the log.
Although I remember seeing that hash_seq_search message recently when
dealing with this bug, it does not show up in the course of standard
regression tests.
regards,
Sergey
*******************************************************************
Sergey E. Koposov
Max Planck Institute for Astronomy/Cambridge Institute for Astronomy/Sternberg Astronomical Institute
Tel: +49-6221-528-349
Web: http://lnfm1.sai.msu.ru/~math
E-mail: math@sai.msu.ru
"Sergey E. Koposov" <math@sai.msu.ru> writes:
On Tue, 11 Sep 2007, Tom Lane wrote:
NOTICE: database "contrib_regression" does not exist, skipping
ERROR: too many active hash_seq_search scans
ERROR: too many active hash_seq_search scans
ERROR: too many active hash_seq_search scans
ERROR: too many active hash_seq_search scans
With hash_seq_search ERROR, it was partially a false alarm. I've had some
old postgres daemon hanging around and writing that to the log.
Although I remember seeing that hash_seq_search message recently when
dealing with this bug, it does not show up in the course of standard
regression tests.
Yeah, it's not there on your buildfarm reports, but that's not totally
surprising. I would expect it to start showing up after 100 failed
checkpoint attempts, which is how long it'd take the bgwriter's
hash_seq_search table to overflow ...
regards, tom lane
On Tue, 11 Sep 2007, Tom Lane wrote:
"Sergey E. Koposov" <math@sai.msu.ru> writes:
On Tue, 11 Sep 2007, Tom Lane wrote:
NOTICE: database "contrib_regression" does not exist, skipping
ERROR: too many active hash_seq_search scans
ERROR: too many active hash_seq_search scans
ERROR: too many active hash_seq_search scans
ERROR: too many active hash_seq_search scansWith hash_seq_search ERROR, it was partially a false alarm. I've had some
old postgres daemon hanging around and writing that to the log.
Although I remember seeing that hash_seq_search message recently when
dealing with this bug, it does not show up in the course of standard
regression tests.Yeah, it's not there on your buildfarm reports, but that's not totally
surprising. I would expect it to start showing up after 100 failed
checkpoint attempts, which is how long it'd take the bgwriter's
hash_seq_search table to overflow ...
Yes, indeed. After several make installcheck's
I get
ERROR: too many active hash_seq_search scans, cannot start one on "smgr
relation table"
ERROR: too many active hash_seq_search scans, cannot start one on "smgr
relation table"
Sergey
*******************************************************************
Sergey E. Koposov
Max Planck Institute for Astronomy/Cambridge Institute for Astronomy/Sternberg Astronomical Institute
Tel: +49-6221-528-349
Web: http://lnfm1.sai.msu.ru/~math
E-mail: math@sai.msu.ru
"Sergey E. Koposov" <math@sai.msu.ru> writes:
Yes, indeed. After several make installcheck's
I get
ERROR: too many active hash_seq_search scans, cannot start one on "smgr
relation table"
ERROR: too many active hash_seq_search scans, cannot start one on "smgr
relation table"
Hm, so that must be coming from smgrcloseall(), which is the only user
of hash_seq_search on SMgrRelationHash. I bet that's popping up once a
second and the bgwriter is getting nothing done, because it's failing
again at the bottom of error recovery :-(. It's a good thing you
happened to notice those messages, because this is a pretty bad bug.
Anyway, I've committed a fix for that, so we can get back to the main
question, which is why you're getting the fsync error in the first place.
regards, tom lane
Sergey E. Koposov wrote:
With hash_seq_search ERROR, it was partially a false alarm. I've had
some old postgres daemon hanging around and writing that to the log.
Although I remember seeing that hash_seq_search message recently when
dealing with this bug, it does not show up in the course of standard
regression tests.
Your buildfarm member must be seriously misconfigured if you can get the
logs from different postmasters comingled. Every run gets its own
logfile in its own inst directory.
cheers
andrew
On Tue, 11 Sep 2007, Andrew Dunstan wrote:
Your buildfarm member must be seriously misconfigured if you can get the logs
from different postmasters comingled. Every run gets its own logfile in its
own inst directory.
No, everything I'm doing now about that bug, I'm doing in the very
separate from the buildfarm place. That logs were mixed there.
regards,
Sergey
*******************************************************************
Sergey E. Koposov
Max Planck Institute for Astronomy/Cambridge Institute for Astronomy/Sternberg Astronomical Institute
Tel: +49-6221-528-349
Web: http://lnfm1.sai.msu.ru/~math
E-mail: math@sai.msu.ru
On Tue, 11 Sep 2007, Tom Lane wrote:
Well, the first thing I'd suggest is trying to localize which Assert
makes it fail. From the bug's behavior I think it is highly probable
that the problem is in fsync signalling, which puts it either in
bgwriter.c or md.c. Try recompiling those modules separately without
cassert (leaving all else enabled) and see if the problem comes and
goes; if so, comment out one Assert at a time till you find which one.
It turned out that the offending assert is
Assert(BgWriterShmem != NULL); in bgwriter.c:990
After commenting it out everything works.
Also, I tried to add 'volatile' to the declaration of BgWriterShmem. After
that the problem disappears too.
I'm not sure that it demonstrates that it's not an ICC bug, because
obviously 'volatile' flag can change the way how the compiler works...
I tried add the volatile keyword for BgWriterMem in PG 8.2.4, and indeed
it solved the problem with PG8.2.4 version too.
From what I see in bgwriter.c, the volatile keyword for BgWriterShmem
seems very reasonable to me, although I'm not sure that it's really
required there....
regards,
Sergey
PS I'm sorry for the wrong information about anti-aliasing flags for ICC.
I was obviously confused by the ICC docs.
*******************************************************************
Sergey E. Koposov
Max Planck Institute for Astronomy/Cambridge Institute for Astronomy/Sternberg Astronomical Institute
Tel: +49-6221-528-349
Web: http://lnfm1.sai.msu.ru/~math
E-mail: math@sai.msu.ru
"Sergey E. Koposov" <math@sai.msu.ru> writes:
It turned out that the offending assert is
Assert(BgWriterShmem != NULL); in bgwriter.c:990
After commenting it out everything works.
That's simply bizarre ...
Also, I tried to add 'volatile' to the declaration of BgWriterShmem. After
that the problem disappears too.
Hm. I don't see any very good reason in the code to add the "volatile",
and I see at least one place where we'd have to cast it away (the MemSet
at line 836). My inclination is just to remove the Assert at line 990.
It's not proving anything, since if indeed BgWriterShmem was NULL there,
we'd dump core on the dereferences just a couple lines below.
Do you want this patched any further back than HEAD? The buildfarm
status page doesn't show dugong doing any back branches ...
regards, tom lane
Teodor Sigaev <teodor@sigaev.ru> writes:
It seems to me last run
(http://pgbuildfarm.org/cgi-bin/show_log.pl?nm=dugong&dt=2007-09-11%2016:05:01)
points to problem with hash implementation.
SELECT to_tsvector('thesaurus_tst', 'one postgres one two one two three one');
+ NOTICE: thesaurus word-sample "the" is recognized as stop-word, assign any
stop-word (rule 8)
At this place of tsdicts test dictionary thesaurus should be already loaded and
initialized, but this NOTICE points that thesaurus was initialized here.
I just realized what that probably is actually from: there was a cache
invalidation event sometime between when thesaurus was initially loaded
and when this statement tried to use it, so the ts_cache entry had to
be reloaded. The parallel regression tests are quite capable of
provoking sinval queue overflows (and ensuing cache resets) at fairly
random places.
It is not good design to have any user-visible behavior that occurs during
a cache load, because you can't predict when those will happen. Perhaps
this NOTICE should not be emitted, or should be emitted from some other
place.
(In fact, this test ought to be failing right now on whichever buildfarm
machine is supposed to be testing CLOBBER_CACHE_ALWAYS. Which animal
was that again?)
regards, tom lane
Tom Lane wrote:
Teodor Sigaev <teodor@sigaev.ru> writes:
It seems to me last run
(http://pgbuildfarm.org/cgi-bin/show_log.pl?nm=dugong&dt=2007-09-11%2016:05:01)
points to problem with hash implementation.SELECT to_tsvector('thesaurus_tst', 'one postgres one two one two three one');
+ NOTICE: thesaurus word-sample "the" is recognized as stop-word, assign any
stop-word (rule 8)At this place of tsdicts test dictionary thesaurus should be already loaded and
initialized, but this NOTICE points that thesaurus was initialized here.I just realized what that probably is actually from: there was a cache
invalidation event sometime between when thesaurus was initially loaded
and when this statement tried to use it, so the ts_cache entry had to
be reloaded. The parallel regression tests are quite capable of
provoking sinval queue overflows (and ensuing cache resets) at fairly
random places.It is not good design to have any user-visible behavior that occurs during
a cache load, because you can't predict when those will happen. Perhaps
this NOTICE should not be emitted, or should be emitted from some other
place.(In fact, this test ought to be failing right now on whichever buildfarm
machine is supposed to be testing CLOBBER_CACHE_ALWAYS. Which animal
was that again?)
fwiw - I managed to trigger that exact same regression failure during my
testing of the pltcl patch on quagga too ...
Stefan
"Stefan Kaltenbrunner" <stefan@kaltenbrunner.cc> writes:
Tom Lane wrote:
Teodor Sigaev <teodor@sigaev.ru> writes:
It seems to me last run
(http://pgbuildfarm.org/cgi-bin/show_log.pl?nm=dugong&dt=2007-09-11%2016:05:01)
points to problem with hash implementation.SELECT to_tsvector('thesaurus_tst', 'one postgres one two one two three one');
+ NOTICE: thesaurus word-sample "the" is recognized as stop-word, assign any
stop-word (rule 8)
FWIW what does this message even mean? Is "assign any stop-word" just a
description of where this message is coming from? Is it imperative,
instructing the user to do something?
--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com