Curious buildfarm failures

Started by Tom Laneabout 13 years ago36 messageshackers

tgl@sss.pgh.pa.us

about 13 years ago

Since commit 2065dd2834e832eb820f1fbcd16746d6af1f6037, there have been
a few buildfarm failures along the lines of

-- Commit table drop
COMMIT PREPARED 'regress-two';
! PANIC: failed to re-find shared proclock object
! PANIC: failed to re-find shared proclock object
! connection to server was lost

Evidently I bollixed something, but what? I've been unable to reproduce
this locally so far. Anybody see what's wrong?

Another thing is that dugong has been reproducibly failing with

 drop cascades to table testschema.atable
  -- Should succeed
  DROP TABLESPACE testspace;
+ ERROR:  tablespace "testspace" is not empty

since the elog-doesn't-return patch (b853eb97) went in. Maybe this is
some local problem there but I'm suspicious that there's a connection.
But what?

Any insights out there?

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Andres Freund

andres@anarazel.de

about 13 years ago

In reply to: Tom Lane (#1)

Re: Curious buildfarm failures

On 2013-01-14 16:35:48 -0500, Tom Lane wrote:

Since commit 2065dd2834e832eb820f1fbcd16746d6af1f6037, there have been
a few buildfarm failures along the lines of

-- Commit table drop
COMMIT PREPARED 'regress-two';
! PANIC: failed to re-find shared proclock object
! PANIC: failed to re-find shared proclock object
! connection to server was lost

Evidently I bollixed something, but what? I've been unable to reproduce
this locally so far. Anybody see what's wrong?

Another thing is that dugong has been reproducibly failing with
drop cascades to table testschema.atable
-- Should succeed
DROP TABLESPACE testspace;
+ ERROR:  tablespace "testspace" is not empty
since the elog-doesn't-return patch (b853eb97) went in. Maybe this is
some local problem there but I'm suspicious that there's a connection.
But what?

Any insights out there?

It also has:

LOG: received fast shutdown request
LOG: aborting any active transactions
LOG: autovacuum launcher shutting down
LOG: shutting down
FATAL: could not open file "base/16384/28182": No such file or directory
CONTEXT: writing block 6 of relation base/16384/28182
TRAP: FailedAssertion("!(PrivateRefCount[i] == 0)", File: "bufmgr.c", Line: 1743)
LOG: checkpointer process (PID 30366) was terminated by signal 6: Aborted
LOG: terminating any other active server processes
LOG: abnormal database system shutdown

================== stack trace: pgsql.9958/src/test/regress/tmp_check/data/core ==================
Using host libthread_db library "/lib/tls/libthread_db.so.1".

warning: Can't read pathname for load map: Input/output error.
Core was generated by `postgres: checkpointer process '.
Program terminated with signal 6, Aborted.

#0 0xa000000000010620 in __kernel_syscall_via_break ()
#0 0xa000000000010620 in __kernel_syscall_via_break ()
#1 0x2000000000953bb0 in raise () from /lib/tls/libc.so.6.1
#2 0x2000000000956df0 in abort () from /lib/tls/libc.so.6.1
#3 0x4000000000b4b510 in ExceptionalCondition (
conditionName=0x4000000000d76390 "!(PrivateRefCount[i] == 0)",
errorType=0x4000000000d06500 "FailedAssertion",
fileName=0x4000000000d76260 "bufmgr.c", lineNumber=1743) at assert.c:54
#4 0x40000000007a7d20 in AtProcExit_Buffers (code=1, arg=59) at bufmgr.c:1743
#5 0x40000000007c4e50 in shmem_exit (code=1) at ipc.c:221
#6 0x40000000007c4fa0 in proc_exit_prepare (code=1) at ipc.c:181
#7 0x40000000007c4ab0 in proc_exit (code=1) at ipc.c:96
#8 0x4000000000b5d390 in errfinish (dummy=0) at elog.c:518
#9 0x4000000000823380 in _mdfd_getseg (reln=0x6000000000155420,
forknum=1397792, blkno=6, skipFsync=0 '\0', behavior=EXTENSION_FAIL)
at md.c:577
#10 0x400000000081e5c0 in mdwrite (reln=0x6000000000155420,
forknum=MAIN_FORKNUM, blocknum=6, buffer=0x2000000001432ea0 "",
skipFsync=0 '\0') at md.c:735
#11 0x4000000000824690 in smgrwrite (reln=0x6000000000155420,
forknum=MAIN_FORKNUM, blocknum=6, buffer=0x2000000001432ea0 "",
skipFsync=0 '\0') at smgr.c:534
#12 0x400000000079e510 in FlushBuffer (buf=0x1, reln=0x6000000000155420)
at bufmgr.c:1941
#13 0x40000000007a10b0 in SyncOneBuffer (buf_id=0, skip_recently_used=0 '\0')
at bufmgr.c:1677
#14 0x40000000007a0c00 in CheckPointBuffers (flags=5) at bufmgr.c:1284
#15 0x40000000001fcbf0 in CheckPointGuts (checkPointRedo=80827000, flags=5)
at xlog.c:7391
#16 0x40000000001fb2a0 in CreateCheckPoint (flags=5) at xlog.c:7240
#17 0x40000000001f6820 in ShutdownXLOG (code=14699520,
arg=4611686018440093920) at xlog.c:6823
#18 0x400000000072d780 in _setjmp_lpad_CheckpointerMain_0$0$18 ()
at checkpointer.c:413
#19 0x4000000000235810 in AuxiliaryProcessMain (argc=496536,
argv=0x60000fffff80e520) at bootstrap.c:433
#20 0x40000000007172b0 in StartChildProcess (type=508288) at postmaster.c:4956
#21 0x4000000000713f50 in reaper (postgres_signal_arg=30365)
at postmaster.c:2568
#22 <signal handler called>
#23 0xa000000000010620 in __kernel_syscall_via_break ()
#24 0x2000000000953f70 in sigprocmask () from /lib/tls/libc.so.6.1
#25 0x4000000000720480 in ServerLoop () at postmaster.c:1521
#26 0x400000000071d9d0 in PostmasterMain (argc=6, argv=0x60000000000d85e0)
at postmaster.c:1244
#27 0x4000000000577a30 in main (argc=6, argv=0x60000000000d8010) at main.c:197

in the log. So it seems like it also could be related to locking
changes although I don't immediately see why.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Andres Freund

andres@anarazel.de

about 13 years ago

In reply to: Andres Freund (#2)

Re: Curious buildfarm failures

On 2013-01-14 22:50:16 +0100, Andres Freund wrote:

On 2013-01-14 16:35:48 -0500, Tom Lane wrote:
Since commit 2065dd2834e832eb820f1fbcd16746d6af1f6037, there have been
a few buildfarm failures along the lines of

-- Commit table drop
COMMIT PREPARED 'regress-two';
! PANIC: failed to re-find shared proclock object
! PANIC: failed to re-find shared proclock object
! connection to server was lost

Evidently I bollixed something, but what? I've been unable to reproduce
this locally so far. Anybody see what's wrong?

Another thing is that dugong has been reproducibly failing with
drop cascades to table testschema.atable
-- Should succeed
DROP TABLESPACE testspace;
+ ERROR:  tablespace "testspace" is not empty
since the elog-doesn't-return patch (b853eb97) went in. Maybe this is
some local problem there but I'm suspicious that there's a connection.
But what?

Any insights out there?
It also has:

FATAL: could not open file "base/16384/28182": No such file or directory
CONTEXT: writing block 6 of relation base/16384/28182
TRAP: FailedAssertion("!(PrivateRefCount[i] == 0)", File: "bufmgr.c", Line: 1743)

#3 0x4000000000b4b510 in ExceptionalCondition (
conditionName=0x4000000000d76390 "!(PrivateRefCount[i] == 0)",
errorType=0x4000000000d06500 "FailedAssertion",
fileName=0x4000000000d76260 "bufmgr.c", lineNumber=1743) at assert.c:54
#4 0x40000000007a7d20 in AtProcExit_Buffers (code=1, arg=59) at bufmgr.c:1743
#5 0x40000000007c4e50 in shmem_exit (code=1) at ipc.c:221

in the log. So it seems like it also could be related to locking
changes although I don't immediately see why.

No such "luck" though, the animal only applied the elog commits:
http://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=dugong&dt=2013-01-14%2000%3A00%3A02&stg=SCM-checkout

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

about 13 years ago

In reply to: Tom Lane (#1)

Re: Curious buildfarm failures

On 14.01.2013 23:35, Tom Lane wrote:

Since commit 2065dd2834e832eb820f1fbcd16746d6af1f6037, there have been
a few buildfarm failures along the lines of

-- Commit table drop
COMMIT PREPARED 'regress-two';
! PANIC: failed to re-find shared proclock object
! PANIC: failed to re-find shared proclock object
! connection to server was lost

Evidently I bollixed something, but what? I've been unable to reproduce
this locally so far. Anybody see what's wrong?

I was able to reproduce this by setting max_locks_per_transaction and
max_connections to the minimum. My assumption is that there's something
wrong in the way hash_update_hash_key() handles collisions.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

about 13 years ago

In reply to: Heikki Linnakangas (#4)

Re: Curious buildfarm failures

On 15.01.2013 00:14, Heikki Linnakangas wrote:

On 14.01.2013 23:35, Tom Lane wrote:

Since commit 2065dd2834e832eb820f1fbcd16746d6af1f6037, there have been
a few buildfarm failures along the lines of

-- Commit table drop
COMMIT PREPARED 'regress-two';
! PANIC: failed to re-find shared proclock object
! PANIC: failed to re-find shared proclock object
! connection to server was lost

Evidently I bollixed something, but what? I've been unable to reproduce
this locally so far. Anybody see what's wrong?

I was able to reproduce this by setting max_locks_per_transaction and
max_connections to the minimum. My assumption is that there's something
wrong in the way hash_update_hash_key() handles collisions.

The problem seems to be when the the old and the key hash to the same
bucket. In that case, hash_update_hash_key() tries to link the entry to
itself. The attached patch fixes it for me.

- Heikki

Tom Lane

tgl@sss.pgh.pa.us

about 13 years ago

In reply to: Heikki Linnakangas (#5)

Re: Curious buildfarm failures

Heikki Linnakangas <hlinnakangas@vmware.com> writes:

The problem seems to be when the the old and the key hash to the same
bucket. In that case, hash_update_hash_key() tries to link the entry to
itself. The attached patch fixes it for me.

Doh! I had a feeling that that needed a special case, but didn't think
hard enough. Thanks.

I think the patch could do with more than no comment, but will fix
that and apply.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Andres Freund

andres@anarazel.de

about 13 years ago

In reply to: Andres Freund (#3)

Re: Curious buildfarm failures

On 2013-01-14 22:56:47 +0100, Andres Freund wrote:

On 2013-01-14 22:50:16 +0100, Andres Freund wrote:
On 2013-01-14 16:35:48 -0500, Tom Lane wrote:
Since commit 2065dd2834e832eb820f1fbcd16746d6af1f6037, there have been
a few buildfarm failures along the lines of

-- Commit table drop
COMMIT PREPARED 'regress-two';
! PANIC: failed to re-find shared proclock object
! PANIC: failed to re-find shared proclock object
! connection to server was lost

Evidently I bollixed something, but what? I've been unable to reproduce
this locally so far. Anybody see what's wrong?

Another thing is that dugong has been reproducibly failing with
drop cascades to table testschema.atable
-- Should succeed
DROP TABLESPACE testspace;
+ ERROR:  tablespace "testspace" is not empty
since the elog-doesn't-return patch (b853eb97) went in. Maybe this is
some local problem there but I'm suspicious that there's a connection.
But what?

Any insights out there?
It also has:

FATAL: could not open file "base/16384/28182": No such file or directory
CONTEXT: writing block 6 of relation base/16384/28182
TRAP: FailedAssertion("!(PrivateRefCount[i] == 0)", File: "bufmgr.c", Line: 1743)
#3 0x4000000000b4b510 in ExceptionalCondition (
conditionName=0x4000000000d76390 "!(PrivateRefCount[i] == 0)",
errorType=0x4000000000d06500 "FailedAssertion",
fileName=0x4000000000d76260 "bufmgr.c", lineNumber=1743) at assert.c:54
#4 0x40000000007a7d20 in AtProcExit_Buffers (code=1, arg=59) at bufmgr.c:1743
#5 0x40000000007c4e50 in shmem_exit (code=1) at ipc.c:221

in the log. So it seems like it also could be related to locking
changes although I don't immediately see why.

No such "luck" though, the animal only applied the elog commits:
http://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=dugong&dt=2013-01-14%2000%3A00%3A02&stg=SCM-checkout

Do you have idea whats going on? I don't really have any clue other than
guessing it might be an compiler bug.

Could the buildfarm owner perhaps tell us which files are left in the
tablespace 'testspace'?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Tom Lane

tgl@sss.pgh.pa.us

about 13 years ago

In reply to: Andres Freund (#7)

Re: Curious buildfarm failures

Andres Freund <andres@2ndquadrant.com> writes:

On 2013-01-14 16:35:48 -0500, Tom Lane wrote:
Another thing is that dugong has been reproducibly failing with
drop cascades to table testschema.atable
-- Should succeed
DROP TABLESPACE testspace;
+ ERROR:  tablespace "testspace" is not empty
since the elog-doesn't-return patch (b853eb97) went in. Maybe this is
some local problem there but I'm suspicious that there's a connection.
But what?

Do you have idea whats going on? I don't really have any clue other than
guessing it might be an compiler bug.

I'm suspicious of that too, but it's hard to see why it would manifest
like this while everything else appears to be fine. A codegen bug
triggered by elog ought to show up in a lot of places, one would think.

Could the buildfarm owner perhaps tell us which files are left in the
tablespace 'testspace'?

I'm betting the answer is "none", and that what's happening is some sort
of timing problem (ie, the DROP TABLESPACE somehow isn't waiting for the
checkpointer process to clean out all the just-dropped files). The
PANIC at shutdown looks like it might be some sort of doppelganger of
the same bug, ie dropped table cleaned out too early.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Andrew Dunstan

andrew@dunslane.net

about 13 years ago

In reply to: Andres Freund (#7)

Re: Curious buildfarm failures

On 01/15/2013 11:04 AM, Andres Freund wrote:

Could the buildfarm owner perhaps tell us which files are left in the
tablespace 'testspace'?

They will not be able to easily - the workspace is normally cleared out
at the end of each run.

cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10

Andres Freund

andres@anarazel.de

about 13 years ago

In reply to: Tom Lane (#8)

Re: Curious buildfarm failures

On 2013-01-15 11:19:28 -0500, Tom Lane wrote:

Andres Freund <andres@2ndquadrant.com> writes:
On 2013-01-14 16:35:48 -0500, Tom Lane wrote:
Another thing is that dugong has been reproducibly failing with
drop cascades to table testschema.atable
-- Should succeed
DROP TABLESPACE testspace;
+ ERROR:  tablespace "testspace" is not empty
since the elog-doesn't-return patch (b853eb97) went in. Maybe this is
some local problem there but I'm suspicious that there's a connection.
But what?
Do you have idea whats going on? I don't really have any clue other than
guessing it might be an compiler bug.

I'm suspicious of that too, but it's hard to see why it would manifest
like this while everything else appears to be fine. A codegen bug
triggered by elog ought to show up in a lot of places, one would think.

The make output showed that for some files optimization were disabled by
the compiler because they were to complex. Its possible that it is
related to that :(.

Interestingly the compiler couldn't deduce that
e.g. DateTimeParseError() didn't return with the old ereport definition
but it can with the new one which seems strange.

Could the buildfarm owner perhaps tell us which files are left in the
tablespace 'testspace'?

I'm betting the answer is "none", and that what's happening is some sort
of timing problem (ie, the DROP TABLESPACE somehow isn't waiting for the
checkpointer process to clean out all the just-dropped files). The
PANIC at shutdown looks like it might be some sort of doppelganger of
the same bug, ie dropped table cleaned out too early.

It seems strange tho that it started failing exactly with that commit
and all runs failed exactly the same way since.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11

Tom Lane

tgl@sss.pgh.pa.us

about 13 years ago

In reply to: Andres Freund (#10)

Re: Curious buildfarm failures

Andres Freund <andres@2ndquadrant.com> writes:

Interestingly the compiler couldn't deduce that
e.g. DateTimeParseError() didn't return with the old ereport definition
but it can with the new one which seems strange.

Oooh, I hadn't noticed that. I guess that must indicate that this
version of icc can deduce unreachability from

if (true)
abort();

but not from

(true) ? abort() : (void) 0;

which is a bit odd but not incredible. (I had actually wondered while
working on the patch if this wording might be better for some compilers;
seems that's the case.)

What that means is that this compiler was not previously aware that
either ereport(ERROR) or elog(ERROR) doesn't return, but it now knows
that for both. So that greatly expands the scope of places where
behavior might have changed. Doesn't simplify our task :-(

I'm betting the answer is "none", and that what's happening is some sort
of timing problem (ie, the DROP TABLESPACE somehow isn't waiting for the
checkpointer process to clean out all the just-dropped files).

It seems strange tho that it started failing exactly with that commit
and all runs failed exactly the same way since.

Well, it could be quite reproducible, if for example what's happening is
that the DROP is failing to wait for the checkpointer at all.

Is there a way to enable log_checkpoints during a buildfarm run?
It'd be good to get timestamps added to the postmaster log entries, too.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12

Andrew Dunstan

andrew@dunslane.net

about 13 years ago

In reply to: Tom Lane (#11)

Re: Curious buildfarm failures

On 01/15/2013 11:57 AM, Tom Lane wrote:

Well, it could be quite reproducible, if for example what's happening is
that the DROP is failing to wait for the checkpointer at all.

Is there a way to enable log_checkpoints during a buildfarm run?
It'd be good to get timestamps added to the postmaster log entries, too.

Yes, it's very easy. In the config file, do something like:

extra_config =>
{
DEFAULT => [
q(log_line_prefix = '%t [%c:%l] '),
"log_checkpoints = 'true',
],
},

cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13

Andrew Dunstan

andrew@dunslane.net

about 13 years ago

In reply to: Andrew Dunstan (#12)

Re: Curious buildfarm failures

On 01/15/2013 12:07 PM, Andrew Dunstan wrote:

On 01/15/2013 11:57 AM, Tom Lane wrote:

Well, it could be quite reproducible, if for example what's happening is
that the DROP is failing to wait for the checkpointer at all.

Is there a way to enable log_checkpoints during a buildfarm run?
It'd be good to get timestamps added to the postmaster log entries, too.

Yes, it's very easy. In the config file, do something like:

I had a missing quote. should be:

extra_config =>
{
DEFAULT => [
q(log_line_prefix = '%t [%c:%l] '),
"log_checkpoints = 'true'",
],
},

cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14

Sergey Koposov

koposov@ast.cam.ac.uk

about 13 years ago

In reply to: Andrew Dunstan (#13)

Re: Curious buildfarm failures (fwd)

Hi,

Date: Tue, 15 Jan 2013 11:57:07 -0500
From: Tom Lane <tgl@sss.pgh.pa.us>
To: Andres Freund <andres@2ndquadrant.com>
Cc: math@sai.msu.ru, pgsql-hackers@postgreSQL.org,
Andrew Dunstan <andrew@dunslane.net>
Subject: Re: Curious buildfarm failures

Well, it could be quite reproducible, if for example what's happening is
that the DROP is failing to wait for the checkpointer at all.

Is there a way to enable log_checkpoints during a buildfarm run?
It'd be good to get timestamps added to the postmaster log entries, too.

Here is the log output from the failing pg_regress after enabling checkpoints and timestamps:

2013-01-15 21:20:19 MSK [50f58fd3.589e:1] LOG: database system was shut down at 2013-01-15 21:20:19 MS
K
2013-01-15 21:20:19 MSK [50f58fd3.58a2:1] LOG: autovacuum launcher started
2013-01-15 21:20:19 MSK [50f58fd3.5899:1] LOG: database system is ready to accept connections
2013-01-15 21:20:20 MSK [50f58fd3.589f:1] LOG: checkpoint starting: immediate force wait
2013-01-15 21:20:21 MSK [50f58fd3.589f:2] LOG: checkpoint complete: wrote 3 buffers (0.0%); 0 transact
ion log file(s) added, 0 removed, 0 recycled; write=0.604 s, sync=0.000 s, total=0.605 s; sync files=0,
longest=0.000 s, average=0.000 s
2013-01-15 21:20:21 MSK [50f58fd3.589f:3] LOG: checkpoint starting: immediate force wait
2013-01-15 21:20:21 MSK [50f58fd3.589f:4] LOG: checkpoint complete: wrote 0 buffers (0.0%); 0 transact
ion log file(s) added, 0 removed, 0 recycled; write=0.000 s, sync=0.000 s, total=0.000 s; sync files=0,
longest=0.000 s, average=0.000 s
2013-01-15 21:20:21 MSK [50f58fd5.58ac:1] ERROR: unrecognized parameter "some_nonexistent_parameter"
2013-01-15 21:20:21 MSK [50f58fd5.58ac:2] STATEMENT: ALTER TABLESPACE testspace SET (some_nonexistent_
parameter = true);
2013-01-15 21:20:21 MSK [50f58fd5.58ac:3] ERROR: RESET must not include values for parameters
2013-01-15 21:20:21 MSK [50f58fd5.58ac:4] STATEMENT: ALTER TABLESPACE testspace RESET (random_page_cos
t = 2.0);
2013-01-15 21:20:21 MSK [50f58fd5.58ac:5] ERROR: duplicate key value violates unique constraint "anind
ex"
2013-01-15 21:20:21 MSK [50f58fd5.58ac:6] DETAIL: Key (column1)=(1) already exists.
2013-01-15 21:20:21 MSK [50f58fd5.58ac:7] STATEMENT: INSERT INTO testschema.atable VALUES(1);
2013-01-15 21:20:21 MSK [50f58fd5.58ac:8] ERROR: directory "/no/such/location" does not exist
2013-01-15 21:20:21 MSK [50f58fd5.58ac:9] STATEMENT: CREATE TABLESPACE badspace LOCATION '/no/such/loc
ation';
2013-01-15 21:20:21 MSK [50f58fd5.58ac:10] ERROR: tablespace "nosuchspace" does not exist
2013-01-15 21:20:21 MSK [50f58fd5.58ac:11] STATEMENT: CREATE TABLE bar (i int) TABLESPACE nosuchspace;
2013-01-15 21:20:21 MSK [50f58fd3.589f:5] LOG: checkpoint starting: immediate force wait
2013-01-15 21:20:21 MSK [50f58fd3.589f:6] LOG: checkpoint complete: wrote 37 buffers (0.2%); 0 transac
tion log file(s) added, 0 removed, 0 recycled; write=0.001 s, sync=0.000 s, total=0.001 s; sync files=0
, longest=0.000 s, average=0.000 s
2013-01-15 21:20:21 MSK [50f58fd5.58ac:12] ERROR: tablespace "testspace" is not empty
2013-01-15 21:20:21 MSK [50f58fd5.58ac:13] STATEMENT: DROP TABLESPACE testspace;
2013-01-15 21:20:21 MSK [50f58fd3.589f:7] LOG: checkpoint starting: immediate force wait
2013-01-15 21:20:21 MSK [50f58fd3.589f:8] LOG: checkpoint complete: wrote 9 buffers (0.1%); 0 transact
ion log file(s) added, 0 removed, 0 recycled; write=0.001 s, sync=0.000 s, total=0.001 s; sync files=0,
longest=0.000 s, average=0.000 s
2013-01-15 21:20:21 MSK [50f58fd5.58ac:14] ERROR: tablespace "testspace" is not empty
2013-01-15 21:20:21 MSK [50f58fd5.58ac:15] STATEMENT: DROP TABLESPACE testspace;

And I do see the tblspc file left after the finish of "make check":
tmp_check/data/pg_tblspc/16385/PG_9.3_201212081/16384/16387

Cheers,
S

PS I wouldn't be surprised that it is a compiler bug though. But I did see the
failure with newer icc as well.

*****************************************************
Sergey E. Koposov, PhD, Research Associate
Institute of Astronomy, University of Cambridge
Madingley road, CB3 0HA, Cambridge, UK
Tel: +44-1223-337-551 Web: http://www.ast.cam.ac.uk/~koposov/

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Import Notes

Reply to msg id not found: alpine.LRH.2.00.1301152117460.13066@lnfm1.sai.msu.ruReference msg id not found: alpine.LRH.2.00.1301152117460.13066@lnfm1.sai.msu.ru | Resolved by subject fallback

#15

Andres Freund

andres@anarazel.de

about 13 years ago

In reply to: Sergey Koposov (#14)

Re: Curious buildfarm failures (fwd)

On 2013-01-15 17:27:50 +0000, Sergey Koposov wrote:

Hi,

Date: Tue, 15 Jan 2013 11:57:07 -0500
From: Tom Lane <tgl@sss.pgh.pa.us>
To: Andres Freund <andres@2ndquadrant.com>
Cc: math@sai.msu.ru, pgsql-hackers@postgreSQL.org,
Andrew Dunstan <andrew@dunslane.net>
Subject: Re: Curious buildfarm failures

Well, it could be quite reproducible, if for example what's happening is
that the DROP is failing to wait for the checkpointer at all.

Is there a way to enable log_checkpoints during a buildfarm run?
It'd be good to get timestamps added to the postmaster log entries, too.

Here is the log output from the failing pg_regress after enabling checkpoints and timestamps:

2013-01-15 21:20:19 MSK [50f58fd3.589e:1] LOG: database system was shut down at 2013-01-15 21:20:19 MS
K
2013-01-15 21:20:19 MSK [50f58fd3.58a2:1] LOG: autovacuum launcher started
2013-01-15 21:20:19 MSK [50f58fd3.5899:1] LOG: database system is ready to accept connections
2013-01-15 21:20:20 MSK [50f58fd3.589f:1] LOG: checkpoint starting: immediate force wait
2013-01-15 21:20:21 MSK [50f58fd3.589f:2] LOG: checkpoint complete: wrote 3 buffers (0.0%); 0 transact
ion log file(s) added, 0 removed, 0 recycled; write=0.604 s, sync=0.000 s, total=0.605 s; sync files=0,
longest=0.000 s, average=0.000 s
2013-01-15 21:20:21 MSK [50f58fd3.589f:3] LOG: checkpoint starting: immediate force wait
2013-01-15 21:20:21 MSK [50f58fd3.589f:4] LOG: checkpoint complete: wrote 0 buffers (0.0%); 0 transact
ion log file(s) added, 0 removed, 0 recycled; write=0.000 s, sync=0.000 s, total=0.000 s; sync files=0,
longest=0.000 s, average=0.000 s
2013-01-15 21:20:21 MSK [50f58fd5.58ac:1] ERROR: unrecognized parameter "some_nonexistent_parameter"
2013-01-15 21:20:21 MSK [50f58fd5.58ac:2] STATEMENT: ALTER TABLESPACE testspace SET (some_nonexistent_
parameter = true);
2013-01-15 21:20:21 MSK [50f58fd5.58ac:3] ERROR: RESET must not include values for parameters
2013-01-15 21:20:21 MSK [50f58fd5.58ac:4] STATEMENT: ALTER TABLESPACE testspace RESET (random_page_cos
t = 2.0);
2013-01-15 21:20:21 MSK [50f58fd5.58ac:5] ERROR: duplicate key value violates unique constraint "anind
ex"
2013-01-15 21:20:21 MSK [50f58fd5.58ac:6] DETAIL: Key (column1)=(1) already exists.
2013-01-15 21:20:21 MSK [50f58fd5.58ac:7] STATEMENT: INSERT INTO testschema.atable VALUES(1);
2013-01-15 21:20:21 MSK [50f58fd5.58ac:8] ERROR: directory "/no/such/location" does not exist
2013-01-15 21:20:21 MSK [50f58fd5.58ac:9] STATEMENT: CREATE TABLESPACE badspace LOCATION '/no/such/loc
ation';
2013-01-15 21:20:21 MSK [50f58fd5.58ac:10] ERROR: tablespace "nosuchspace" does not exist
2013-01-15 21:20:21 MSK [50f58fd5.58ac:11] STATEMENT: CREATE TABLE bar (i int) TABLESPACE nosuchspace;
2013-01-15 21:20:21 MSK [50f58fd3.589f:5] LOG: checkpoint starting: immediate force wait
2013-01-15 21:20:21 MSK [50f58fd3.589f:6] LOG: checkpoint complete: wrote 37 buffers (0.2%); 0 transac
tion log file(s) added, 0 removed, 0 recycled; write=0.001 s, sync=0.000 s, total=0.001 s; sync files=0
, longest=0.000 s, average=0.000 s
2013-01-15 21:20:21 MSK [50f58fd5.58ac:12] ERROR: tablespace "testspace" is not empty
2013-01-15 21:20:21 MSK [50f58fd5.58ac:13] STATEMENT: DROP TABLESPACE testspace;
2013-01-15 21:20:21 MSK [50f58fd3.589f:7] LOG: checkpoint starting: immediate force wait
2013-01-15 21:20:21 MSK [50f58fd3.589f:8] LOG: checkpoint complete: wrote 9 buffers (0.1%); 0 transact
ion log file(s) added, 0 removed, 0 recycled; write=0.001 s, sync=0.000 s, total=0.001 s; sync files=0,
longest=0.000 s, average=0.000 s
2013-01-15 21:20:21 MSK [50f58fd5.58ac:14] ERROR: tablespace "testspace" is not empty
2013-01-15 21:20:21 MSK [50f58fd5.58ac:15] STATEMENT: DROP TABLESPACE testspace;

And I do see the tblspc file left after the finish of "make check":
tmp_check/data/pg_tblspc/16385/PG_9.3_201212081/16384/16387

Cheers,
S

PS I wouldn't be surprised that it is a compiler bug though. But I did see
the failure with newer icc as well.

Any chance you could run make check again but with log_statement=all and
log_min_messages=debug2? That might tell us a bit more, whether the
dependency code doesn't work right or whether the checkpoint is doing
strange things.

Thannks,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16

Sergey Koposov

koposov@ast.cam.ac.uk

about 13 years ago

In reply to: Andres Freund (#15)

Re: Curious buildfarm failures (fwd)

On Tue, 15 Jan 2013, Andres Freund wrote:

Any chance you could run make check again but with log_statement=all and
log_min_messages=debug2? That might tell us a bit more, whether the
dependency code doesn't work right or whether the checkpoint is doing
strange things.

Here it is :

2013-01-15 23:06:18 MSK [50f5a8aa.1162:1] DEBUG: SlruScanDirectory invoking callback on pg_notify/0000
2013-01-15 23:06:18 MSK [50f5a8aa.1162:2] DEBUG: removing file "pg_notify/0000"
2013-01-15 23:06:18 MSK [50f5a8aa.1162:3] DEBUG: max_safe_fds = 985, usable_fds = 1000, already_open = 5
2013-01-15 23:06:18 MSK [50f5a8aa.1167:1] LOG: database system was shut down at 2013-01-15 23:06:18 MSK
2013-01-15 23:06:18 MSK [50f5a8aa.1167:2] DEBUG: checkpoint record is at 0/17700E0
2013-01-15 23:06:18 MSK [50f5a8aa.1167:3] DEBUG: redo record is at 0/17700E0; shutdown TRUE
2013-01-15 23:06:18 MSK [50f5a8aa.1167:4] DEBUG: next transaction ID: 0/686; next OID: 12031
2013-01-15 23:06:18 MSK [50f5a8aa.1167:5] DEBUG: next MultiXactId: 1; next MultiXactOffset: 0
2013-01-15 23:06:18 MSK [50f5a8aa.1167:6] DEBUG: oldest unfrozen transaction ID: 676, in database 1
2013-01-15 23:06:18 MSK [50f5a8aa.1167:7] DEBUG: transaction ID wrap limit is 2147484323, limited by database with OID 1
2013-01-15 23:06:18 MSK [50f5a8aa.1168:1] DEBUG: checkpointer updated shared memory configuration values
2013-01-15 23:06:18 MSK [50f5a8aa.116b:1] LOG: autovacuum launcher started
2013-01-15 23:06:18 MSK [50f5a8aa.1162:4] LOG: database system is ready to accept connections
2013-01-15 23:06:19 MSK [50f5a8aa.1162:5] DEBUG: forked new backend, pid=4463 socket=8
2013-01-15 23:06:19 MSK [50f5a8aa.1162:6] DEBUG: server process (PID 4463) exited with exit code 0
2013-01-15 23:06:19 MSK [50f5a8aa.1162:7] DEBUG: forked new backend, pid=4465 socket=8
2013-01-15 23:06:19 MSK [50f5a8ab.1171:1] LOG: statement: CREATE DATABASE "regression" TEMPLATE=template0
2013-01-15 23:06:19 MSK [50f5a8aa.1168:2] LOG: checkpoint starting: immediate force wait
2013-01-15 23:06:19 MSK [50f5a8aa.1168:3] DEBUG: SlruScanDirectory invoking callback on pg_multixact/offsets/0000
2013-01-15 23:06:19 MSK [50f5a8aa.1168:4] DEBUG: SlruScanDirectory invoking callback on pg_multixact/members/0000
2013-01-15 23:06:19 MSK [50f5a8aa.1168:5] DEBUG: attempting to remove WAL segments older than log file 000000000000000000000000
2013-01-15 23:06:19 MSK [50f5a8aa.1168:6] DEBUG: SlruScanDirectory invoking callback on pg_subtrans/0000
2013-01-15 23:06:19 MSK [50f5a8aa.1168:7] LOG: checkpoint complete: wrote 3 buffers (0.0%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=0.001 s, sync=0.000 s, total=0.001 s; sync files=0, longest=0.000 s, average=0.000 s
2013-01-15 23:06:19 MSK [50f5a8aa.1168:8] LOG: checkpoint starting: immediate force wait
2013-01-15 23:06:19 MSK [50f5a8aa.1168:9] DEBUG: attempting to remove WAL segments older than log file 000000000000000000000000
2013-01-15 23:06:19 MSK [50f5a8aa.1168:10] DEBUG: SlruScanDirectory invoking callback on pg_subtrans/0000
2013-01-15 23:06:19 MSK [50f5a8aa.1168:11] LOG: checkpoint complete: wrote 0 buffers (0.0%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=0.001 s, sync=0.000 s, total=0.001 s; sync files=0, longest=0.000 s, average=0.000 s
2013-01-15 23:06:19 MSK [50f5a8aa.1162:8] DEBUG: server process (PID 4465) exited with exit code 0
2013-01-15 23:06:19 MSK [50f5a8aa.1162:9] DEBUG: forked new backend, pid=4467 socket=8
2013-01-15 23:06:19 MSK [50f5a8ab.1173:1] LOG: statement: ALTER DATABASE "regression" SET lc_messages TO 'C';ALTER DATABASE "regression" SET lc_monetary TO 'C';ALTER DATABASE "regression" SET lc_numeric TO 'C';ALTER DATABASE "regression" SET lc_time TO 'C';ALTER DATABASE "regression" SET timezone_abbreviations TO 'Default';
2013-01-15 23:06:19 MSK [50f5a8aa.1162:10] DEBUG: server process (PID 4467) exited with exit code 0
2013-01-15 23:06:19 MSK [50f5a8aa.1162:11] DEBUG: forked new backend, pid=4469 socket=8
2013-01-15 23:06:19 MSK [50f5a8ab.1175:1] LOG: statement: CREATE TABLESPACE testspace LOCATION '/home/math/pg_git/src/test/regress/testtablespace';
2013-01-15 23:06:19 MSK [50f5a8ab.1175:2] LOG: statement: ALTER TABLESPACE testspace SET (random_page_cost = 1.0);
2013-01-15 23:06:19 MSK [50f5a8ab.1175:3] LOG: statement: ALTER TABLESPACE testspace SET (some_nonexistent_parameter = true);
2013-01-15 23:06:19 MSK [50f5a8ab.1175:4] ERROR: unrecognized parameter "some_nonexistent_parameter"
2013-01-15 23:06:19 MSK [50f5a8ab.1175:5] STATEMENT: ALTER TABLESPACE testspace SET (some_nonexistent_parameter = true);
2013-01-15 23:06:19 MSK [50f5a8ab.1175:6] LOG: statement: ALTER TABLESPACE testspace RESET (random_page_cost = 2.0);
2013-01-15 23:06:19 MSK [50f5a8ab.1175:7] ERROR: RESET must not include values for parameters
2013-01-15 23:06:19 MSK [50f5a8ab.1175:8] STATEMENT: ALTER TABLESPACE testspace RESET (random_page_cost = 2.0);
2013-01-15 23:06:19 MSK [50f5a8ab.1175:9] LOG: statement: ALTER TABLESPACE testspace RESET (random_page_cost, seq_page_cost);
2013-01-15 23:06:19 MSK [50f5a8ab.1175:10] LOG: statement: CREATE SCHEMA testschema;
2013-01-15 23:06:19 MSK [50f5a8ab.1175:11] LOG: statement: CREATE TABLE testschema.foo (i int) TABLESPACE testspace;
2013-01-15 23:06:19 MSK [50f5a8ab.1175:12] LOG: statement: SELECT relname, spcname FROM pg_catalog.pg_tablespace t, pg_catalog.pg_class c
where c.reltablespace = t.oid AND c.relname = 'foo';
2013-01-15 23:06:19 MSK [50f5a8ab.1175:13] LOG: statement: INSERT INTO testschema.foo VALUES(1);
2013-01-15 23:06:19 MSK [50f5a8ab.1175:14] LOG: statement: INSERT INTO testschema.foo VALUES(2);
2013-01-15 23:06:19 MSK [50f5a8ab.1175:15] LOG: statement: CREATE TABLE testschema.asselect TABLESPACE testspace AS SELECT 1;
2013-01-15 23:06:19 MSK [50f5a8ab.1175:16] LOG: statement: SELECT relname, spcname FROM pg_catalog.pg_tablespace t, pg_catalog.pg_class c
where c.reltablespace = t.oid AND c.relname = 'asselect';
2013-01-15 23:06:19 MSK [50f5a8ab.1175:17] LOG: statement: PREPARE selectsource(int) AS SELECT $1;
2013-01-15 23:06:19 MSK [50f5a8ab.1175:18] LOG: statement: CREATE TABLE testschema.asexecute TABLESPACE testspace
AS EXECUTE selectsource(2);
2013-01-15 23:06:19 MSK [50f5a8ab.1175:19] LOG: statement: SELECT relname, spcname FROM pg_catalog.pg_tablespace t, pg_catalog.pg_class c
where c.reltablespace = t.oid AND c.relname = 'asexecute';
2013-01-15 23:06:19 MSK [50f5a8ab.1175:20] LOG: statement: CREATE INDEX foo_idx on testschema.foo(i) TABLESPACE testspace;
2013-01-15 23:06:19 MSK [50f5a8ab.1175:21] DEBUG: building index "foo_idx" on table "foo"
2013-01-15 23:06:19 MSK [50f5a8ab.1175:22] LOG: statement: SELECT relname, spcname FROM pg_catalog.pg_tablespace t, pg_catalog.pg_class c
where c.reltablespace = t.oid AND c.relname = 'foo_idx';
2013-01-15 23:06:19 MSK [50f5a8ab.1175:23] LOG: statement: CREATE TABLE testschema.atable AS VALUES (1), (2);
2013-01-15 23:06:19 MSK [50f5a8ab.1175:24] LOG: statement: CREATE UNIQUE INDEX anindex ON testschema.atable(column1);
2013-01-15 23:06:19 MSK [50f5a8ab.1175:25] DEBUG: building index "anindex" on table "atable"
2013-01-15 23:06:19 MSK [50f5a8ab.1175:26] LOG: statement: ALTER TABLE testschema.atable SET TABLESPACE testspace;
2013-01-15 23:06:19 MSK [50f5a8ab.1175:27] LOG: statement: ALTER INDEX testschema.anindex SET TABLESPACE testspace;
2013-01-15 23:06:19 MSK [50f5a8ab.1175:28] LOG: statement: INSERT INTO testschema.atable VALUES(3);
2013-01-15 23:06:19 MSK [50f5a8ab.1175:29] LOG: statement: INSERT INTO testschema.atable VALUES(1);
2013-01-15 23:06:19 MSK [50f5a8ab.1175:30] ERROR: duplicate key value violates unique constraint "anindex"
2013-01-15 23:06:19 MSK [50f5a8ab.1175:31] DETAIL: Key (column1)=(1) already exists.
2013-01-15 23:06:19 MSK [50f5a8ab.1175:32] STATEMENT: INSERT INTO testschema.atable VALUES(1);
2013-01-15 23:06:19 MSK [50f5a8ab.1175:33] LOG: statement: SELECT COUNT(*) FROM testschema.atable;
2013-01-15 23:06:19 MSK [50f5a8ab.1175:34] LOG: statement: CREATE TABLESPACE badspace LOCATION '/no/such/location';
2013-01-15 23:06:19 MSK [50f5a8ab.1175:35] ERROR: directory "/no/such/location" does not exist
2013-01-15 23:06:19 MSK [50f5a8ab.1175:36] STATEMENT: CREATE TABLESPACE badspace LOCATION '/no/such/location';
2013-01-15 23:06:19 MSK [50f5a8ab.1175:37] LOG: statement: CREATE TABLE bar (i int) TABLESPACE nosuchspace;
2013-01-15 23:06:19 MSK [50f5a8ab.1175:38] ERROR: tablespace "nosuchspace" does not exist
2013-01-15 23:06:19 MSK [50f5a8ab.1175:39] STATEMENT: CREATE TABLE bar (i int) TABLESPACE nosuchspace;
2013-01-15 23:06:19 MSK [50f5a8ab.1175:40] LOG: statement: DROP TABLESPACE testspace;
2013-01-15 23:06:19 MSK [50f5a8aa.1168:12] LOG: checkpoint starting: immediate force wait
2013-01-15 23:06:19 MSK [50f5a8aa.1168:13] DEBUG: attempting to remove WAL segments older than log file 000000000000000000000000
2013-01-15 23:06:19 MSK [50f5a8aa.1168:14] DEBUG: SlruScanDirectory invoking callback on pg_subtrans/0000
2013-01-15 23:06:19 MSK [50f5a8aa.1168:15] LOG: checkpoint complete: wrote 37 buffers (0.2%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=0.000 s, sync=0.000 s, total=0.001 s; sync files=0, longest=0.000 s, average=0.000 s
2013-01-15 23:06:19 MSK [50f5a8ab.1175:41] ERROR: tablespace "testspace" is not empty
2013-01-15 23:06:19 MSK [50f5a8ab.1175:42] STATEMENT: DROP TABLESPACE testspace;
2013-01-15 23:06:19 MSK [50f5a8ab.1175:43] LOG: statement: DROP SCHEMA testschema CASCADE;
2013-01-15 23:06:19 MSK [50f5a8ab.1175:44] DEBUG: drop auto-cascades to index testschema.foo_idx
2013-01-15 23:06:19 MSK [50f5a8ab.1175:45] DEBUG: drop auto-cascades to type testschema.foo
2013-01-15 23:06:19 MSK [50f5a8ab.1175:46] DEBUG: drop auto-cascades to type testschema.foo[]
2013-01-15 23:06:19 MSK [50f5a8ab.1175:47] DEBUG: drop auto-cascades to type testschema.asselect
2013-01-15 23:06:19 MSK [50f5a8ab.1175:48] DEBUG: drop auto-cascades to type testschema.asselect[]
2013-01-15 23:06:19 MSK [50f5a8ab.1175:49] DEBUG: drop auto-cascades to type testschema.asexecute
2013-01-15 23:06:19 MSK [50f5a8ab.1175:50] DEBUG: drop auto-cascades to type testschema.asexecute[]
2013-01-15 23:06:19 MSK [50f5a8ab.1175:51] DEBUG: drop auto-cascades to index testschema.anindex
2013-01-15 23:06:19 MSK [50f5a8ab.1175:52] DEBUG: drop auto-cascades to type testschema.atable
2013-01-15 23:06:19 MSK [50f5a8ab.1175:53] DEBUG: drop auto-cascades to type testschema.atable[]
2013-01-15 23:06:19 MSK [50f5a8ab.1175:54] NOTICE: drop cascades to 4 other objects
2013-01-15 23:06:19 MSK [50f5a8ab.1175:55] DETAIL: drop cascades to table testschema.foo
drop cascades to table testschema.asselect
drop cascades to table testschema.asexecute
drop cascades to table testschema.atable
2013-01-15 23:06:19 MSK [50f5a8ab.1175:56] LOG: statement: DROP TABLESPACE testspace;
2013-01-15 23:06:19 MSK [50f5a8aa.1168:16] LOG: checkpoint starting: immediate force wait
2013-01-15 23:06:19 MSK [50f5a8aa.1168:17] DEBUG: attempting to remove WAL segments older than log file 000000000000000000000000
2013-01-15 23:06:19 MSK [50f5a8aa.1168:18] DEBUG: SlruScanDirectory invoking callback on pg_subtrans/0000
2013-01-15 23:06:19 MSK [50f5a8aa.1168:19] LOG: checkpoint complete: wrote 9 buffers (0.1%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=0.000 s, sync=0.000 s, total=0.000 s; sync files=0, longest=0.000 s, average=0.000 s
2013-01-15 23:06:19 MSK [50f5a8ab.1175:57] ERROR: tablespace "testspace" is not empty
2013-01-15 23:06:19 MSK [50f5a8ab.1175:58] STATEMENT: DROP TABLESPACE testspace;
2013-01-15 23:06:19 MSK [50f5a8aa.1162:12] DEBUG: server process (PID 4469) exited with exit code 0

Cheers,
S
*****************************************************
Sergey E. Koposov, PhD, Research Associate
Institute of Astronomy, University of Cambridge
Madingley road, CB3 0HA, Cambridge, UK
Tel: +44-1223-337-551 Web: http://www.ast.cam.ac.uk/~koposov/

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17

Tom Lane

tgl@sss.pgh.pa.us

about 13 years ago

In reply to: Sergey Koposov (#14)

Re: Curious buildfarm failures (fwd)

Sergey Koposov <koposov@ast.cam.ac.uk> writes:

And I do see the tblspc file left after the finish of "make check":
tmp_check/data/pg_tblspc/16385/PG_9.3_201212081/16384/16387

Interesting. If the tests are run immediately after initdb, 16387
is the relfilenode assigned to table "foo" in the tablespace regression
test. But why would only that table be left behind? There are half
a dozen relations in that tablespace at the point of the DROP CASCADE.

BTW, I just finished trying to reproduce this on an IA64 machine
belonging to Red Hat, without success. So that seems to eliminate
any possibility of the machine architecture being the trigger issue.
The compiler's still a likely cause though.

Anybody have a similar ICC version (dugong's says it is 10.0 20070809)
to try? Also, Sergey, can you find a non-dot-zero release to try?

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18

Sergey Koposov

koposov@ast.cam.ac.uk

about 13 years ago

In reply to: Tom Lane (#17)

Re: Curious buildfarm failures (fwd)

On Tue, 15 Jan 2013, Tom Lane wrote:

BTW, I just finished trying to reproduce this on an IA64 machine
belonging to Red Hat, without success. So that seems to eliminate
any possibility of the machine architecture being the trigger issue.
The compiler's still a likely cause though.
Anybody have a similar ICC version (dugong's says it is 10.0 20070809)
to try? Also, Sergey, can you find a non-dot-zero release to try?

I think it is indeed the main issue.
I've tried 10.1 ( 10.1.011 ) and it doesn't fail.

When I tried 11.1 (icc (ICC) 11.1 20100401 ) it failed in a quite strange
way (I don't quite remember it happenning before)

test tablespace ... ok
parallel group (18 tests): txid int2 text name oid varchar int4 char money
float8 uuid float4 int8 boolean bit enum numeric rangetypes
boolean ... ok
char ... ok
name ... ok
varchar ... ok
text ... ok
int2 ... ok
int4 ... ok
int8 ... ok
oid ... ok
float4 ... ok
float8 ... ok
bit ... ok
numeric ... ok
txid ... ok
uuid ... ok
enum ... ok
money ... ok
rangetypes ... FAILED (test process exited with exit code
2)
test strings ... FAILED (test process exited with exit code
2)
test numerology ... FAILED (test process exited with exit code
2)
parallel group (19 tests): path interval time inet circle macaddr comments
timestamp timestamptz reltime date tstypes tinterval abstime timetz lseg box
polygon point
point ... FAILED (test process exited with exit code
2)
lseg ... FAILED (test process exited with exit code
2)
box ... FAILED (test process exited with exit code
2)
path ... FAILED (test process exited with exit code
2)
polygon ... FAILED (test process exited with exit code
2)
circle ... FAILED (test process exited with exit code
2)
date ... FAILED (test process exited with exit code
2)
time ... FAILED (test process exited with exit code
2)
timetz ... FAILED (test process exited with exit code
2)
timestamp ... FAILED (test process exited with exit code
2)
timestamptz ... FAILED (test process exited with exit code
2)
interval ... FAILED (test process exited with exit code
2)
abstime ... FAILED (test process exited with exit code
2)
reltime ... FAILED (test process exited with exit code
2)
tinterval ... FAILED (test process exited with exit code
2)
inet ... FAILED (test process exited with exit code
2)
macaddr ... FAILED (test process exited with exit code
2)
tstypes ... FAILED (test process exited with exit code
2)
comments ... FAILED (test process exited with exit code
2)
parallel group (6 tests): geometry regex horology type_sanity oidjoins
opr_sanity
geometry ... FAILED
horology ... FAILED
regex ... ok
oidjoins ... ok
type_sanity ... ok
opr_sanity ... ok
test insert ... ok
test create_function_1 ... ok
test create_type ... ok
test create_table ... ok
test create_function_2 ... ok
parallel group (2 tests): copyselect copy
copy ... ok
copyselect ... ok
parallel group (2 tests): create_operator create_misc
create_misc ... ok
create_operator ... ok
parallel group (2 tests): create_view create_index
create_index ... FAILED (test process exited with exit code
2)
create_view ... ok
parallel group (11 tests): constraints triggers create_cast
create_function_3 updatable_views inherit drop_if_exists create_aggregate
create_table_like typed_table vacuum
create_aggregate ... FAILED (test process exited with exit code
2)
create_function_3 ... FAILED (test process exited with exit code
2)
create_cast ... FAILED (test process exited with exit code
2)
constraints ... FAILED (test process exited with exit code
2)
triggers ... FAILED (test process exited with exit code
2)
inherit ... FAILED (test process exited with exit code
2)
create_table_like ... FAILED (test process exited with exit code
2)
typed_table ... FAILED (test process exited with exit code
2)
vacuum ... FAILED (test process exited with exit code
2)
drop_if_exists ... FAILED (test process exited with exit code
2)
updatable_views ... FAILED (test process exited with exit code
2)
test sanity_check ... FAILED (test process exited with exit code
2)
test errors ... FAILED (test process exited with exit code
2)
test select ... FAILED (test process exited with exit code
2)
parallel group (20 tests): join transactions subselect portals
select_having case select_distinct btree_index union hash_index namespace
random delete select_implicit select_into arrays update prepared_xacts
aggregates select_distinct_on
select_into ... FAILED (test process exited with exit code
2)
select_distinct ... FAILED (test process exited with exit code
2)
select_distinct_on ... FAILED (test process exited with exit code
2)
select_implicit ... FAILED (test process exited with exit code
2)
select_having ... FAILED (test process exited with exit code
2)
subselect ... FAILED (test process exited with exit code
2)
union ... FAILED (test process exited with exit code
2)
case ... FAILED (test process exited with exit code
2)
join ... FAILED (test process exited with exit code
2)
aggregates ... FAILED (test process exited with exit code
2)
transactions ... FAILED (test process exited with exit code
2)
random ... failed (ignored) (test process exited with
exit code 2)
portals ... FAILED (test process exited with exit code
2)
arrays ... FAILED (test process exited with exit code
2)
btree_index ... FAILED (test process exited with exit code
2)
hash_index ... FAILED (test process exited with exit code
2)
update ... FAILED (test process exited with exit code
2)
namespace ... FAILED (test process exited with exit code
2)
prepared_xacts ... FAILED (test process exited with exit code
2)
delete ... FAILED (test process exited with exit code
2)
parallel group (3 tests): security_label collate privileges
privileges ... ok
security_label ... ok
collate ... ok
parallel group (2 tests): alter_generic misc
misc ... FAILED
alter_generic ... ok
test rules ... ok
test event_trigger ... ok
parallel group (16 tests): combocid portals_p2 advisory_lock xmlmap json
tsdicts guc functional_deps dependency tsearch cluster select_views window
foreign_data foreign_key bitmapops
select_views ... ok
portals_p2 ... ok
foreign_key ... ok
cluster ... ok
dependency ... ok
guc ... ok
bitmapops ... ok
combocid ... ok
tsearch ... ok
tsdicts ... ok
foreign_data ... ok
window ... ok
xmlmap ... ok
functional_deps ... ok
advisory_lock ... ok
json ... ok
parallel group (19 tests): limit conversion prepare temp returning copy2
plancache sequence xml largeobject rowtypes polymorphism without_oid domain
rangefuncs with truncate alter_table plpgsql
plancache ... ok
limit ... ok
plpgsql ... ok
copy2 ... ok
temp ... ok
domain ... ok
rangefuncs ... ok
prepare ... ok
without_oid ... ok
conversion ... ok
truncate ... ok
alter_table ... ok
sequence ... ok
polymorphism ... ok
rowtypes ... ok
returning ... ok
largeobject ... ok
with ... FAILED
xml ... ok
test stats ... ok
#########

Digging into the log showed the following trap:

TRAP: FailedAssertion("!(( ((void) ((bool) ((! assert_enabled) || ! (!(((bool) (((const void*)(&nodes[n
]->t_tid) != ((void *)0)) && ((&nodes[n]->t_tid)->ip_posid != 0))))) || (ExceptionalCondition("!(((bool
) (((const void*)(&nodes[n]->t_tid) != ((void *)0)) && ((&nodes[n]->t_tid)->ip_posid != 0))))", ("Faile
dAssertion"), "spgdoinsert.c", 1222), 0)))), ( ((void) ((bool) ((! assert_enabled) || ! (!(((bool) ((co
nst void*)(&(&nodes[n]->t_tid)->ip_blkid) != ((void *)0))))) || (ExceptionalCondition("!(((bool) ((cons
t void*)(&(&nodes[n]->t_tid)->ip_blkid) != ((void *)0))))", ("FailedAssertion"), "spgdoinsert.c", 1222)
, 0)))), (BlockNumber) (((&(&nodes[n]->t_tid)->ip_blkid)->bi_hi << 16) | ((uint16) (&(&nodes[n]->t_tid)
->ip_blkid)->bi_lo)) ) ) == leafBlock)", File: "spgdoinsert.c", Line: 1222)
2013-01-16 00:23:39 MSK [50f5bac9.3d6e:48] DEBUG: server process (PID 15774) was terminated by signal
6: Aborted
2013-01-16 00:23:39 MSK [50f5bac9.3d6e:49] DETAIL: Failed process was running: insert into test_range_
spgist select 'empty'::int4range from generate_series(1,500) g;
2013-01-16 00:23:39 MSK [50f5bac9.3d6e:50] LOG: server process (PID 15774) was terminated by signal 6:
Aborted

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19

Andres Freund

andres@anarazel.de

about 13 years ago

In reply to: Tom Lane (#17)

Re: Curious buildfarm failures (fwd)

On 2013-01-15 14:40:11 -0500, Tom Lane wrote:

Sergey Koposov <koposov@ast.cam.ac.uk> writes:

And I do see the tblspc file left after the finish of "make check":
tmp_check/data/pg_tblspc/16385/PG_9.3_201212081/16384/16387

Interesting. If the tests are run immediately after initdb, 16387
is the relfilenode assigned to table "foo" in the tablespace regression
test. But why would only that table be left behind? There are half
a dozen relations in that tablespace at the point of the DROP CASCADE.

BTW, I just finished trying to reproduce this on an IA64 machine
belonging to Red Hat, without success. So that seems to eliminate
any possibility of the machine architecture being the trigger issue.
The compiler's still a likely cause though.

Anybody have a similar ICC version (dugong's says it is 10.0 20070809)
to try? Also, Sergey, can you find a non-dot-zero release to try?

I played a bit arround (thanks Sergey!) and it seems to be some rather
strange optimization issue around the fsync request queue.

Namely changing

/* OK, insert request */
request = &CheckpointerShmem->requests[CheckpointerShmem->num_requests++];
request->rnode = rnode;
request->forknum = forknum;
request->segno = segno;
into
/* OK, insert request */
request = &CheckpointerShmem->requests[CheckpointerShmem->num_requests++];
request->rnode.spcNode = rnode.spcNode;
request->rnode.dbNode = rnode.dbNode;
request->rnode.relNode = rnode.relNode;
request->forknum = forknum;
request->segno = segno;
makes it pass reliably.

Displaying the values of request after the assignment, but without the
change shows bogus values showing up which explains the problems.

How the hell thats correlating with the elog changes I don't yet know.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20

Tom Lane

tgl@sss.pgh.pa.us

about 13 years ago

In reply to: Andres Freund (#19)

Re: Curious buildfarm failures (fwd)

Andres Freund <andres@2ndquadrant.com> writes:

I played a bit arround (thanks Sergey!) and it seems to be some rather
strange optimization issue around the fsync request queue.

Namely changing
request->rnode = rnode;
into
request->rnode.spcNode = rnode.spcNode;
request->rnode.dbNode = rnode.dbNode;
request->rnode.relNode = rnode.relNode;
makes it pass reliably.

Jeez. That's my candidate for weird compiler bug of the month.

How the hell thats correlating with the elog changes I don't yet know.

There is an elog(ERROR) further up in the same function, but it's sure
not clear how that could cause the compiler to misimplement a struct
assignment. Maybe the problem is not in those lines alone, but the fact
that rnode is a pass-by-value struct? (That is, maybe it's the value of
the rnode local variable that's getting munged, somewhere up near the
elog call?) We tend to not use pass-by-value struct params much, so we
might not have noticed a compiler bug associated with that. Or IOW,
does changing ForwardFsyncRequest to use a "const RelFileNode *rnode"
parameter make it go away?

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#21

Andres Freund

andres@anarazel.de

about 13 years ago

In reply to: Tom Lane (#20)

#22

Andres Freund

andres@anarazel.de

about 13 years ago

In reply to: Andres Freund (#21)

#23

Tom Lane