assertion failure 9.3.4

Started by Andrew Dunstanalmost 12 years ago43 messageshackers
Jump to latest
#1Andrew Dunstan
andrew@dunslane.net

With a client's code I have just managed to produce the following
assertion failure on 9.3.4:

2014-04-15 01:02:46 GMT [19854] 76299: LOG: execute <unnamed>:
select * from "asp_ins_event_task_log"( job_id:=1, event_id:=3164,
task_name:='EventUtcComputeTask', task_status_code:='VALID'
, task_start_utc:='04/15/2014 01:02:44.563',
task_end_utc:='04/15/2014 01:02:44.563')
TRAP: FailedAssertion("!(update_xact == ((TransactionId) 0))", File:
"/home/andrew/pgl/pg_9_3/src/backend/access/heap/heapam.c", Line: 5873)
2014-04-15 01:02:46 GMT [11959] 11: LOG: server process (PID 19854)
was terminated by signal 6: Aborted
2014-04-15 01:02:46 GMT [11959] 12: DETAIL: Failed process was
running: select * from "asp_ins_event_task_log"( job_id:=1,
event_id:=3164, task_name:='EventUtcComputeTask', task_status_code
:='VALID', task_start_utc:='04/15/2014 01:02:44.563',
task_end_utc:='04/15/2014 01:02:44.563')
2014-04-15 01:02:46 GMT [11959] 13: LOG: terminating any other
active server processes

When running without assertions, the client reports experiencing tables
with duplicate primary keys among other things. It's apparently quite
reproducible.

I'm digging into this, but it's a nasty bug and any useful thoughts
would be appreciated.

cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#2Andrew Dunstan
andrew@dunslane.net
In reply to: Andrew Dunstan (#1)
Re: assertion failure 9.3.4

On 04/14/2014 09:28 PM, Andrew Dunstan wrote:

With a client's code I have just managed to produce the following
assertion failure on 9.3.4:

2014-04-15 01:02:46 GMT [19854] 76299: LOG: execute <unnamed>:
select * from "asp_ins_event_task_log"( job_id:=1, event_id:=3164,
task_name:='EventUtcComputeTask', task_status_code:='VALID'
, task_start_utc:='04/15/2014 01:02:44.563',
task_end_utc:='04/15/2014 01:02:44.563')
TRAP: FailedAssertion("!(update_xact == ((TransactionId) 0))", File:
"/home/andrew/pgl/pg_9_3/src/backend/access/heap/heapam.c", Line:
5873)
2014-04-15 01:02:46 GMT [11959] 11: LOG: server process (PID 19854)
was terminated by signal 6: Aborted
2014-04-15 01:02:46 GMT [11959] 12: DETAIL: Failed process was
running: select * from "asp_ins_event_task_log"( job_id:=1,
event_id:=3164, task_name:='EventUtcComputeTask', task_status_code
:='VALID', task_start_utc:='04/15/2014 01:02:44.563',
task_end_utc:='04/15/2014 01:02:44.563')
2014-04-15 01:02:46 GMT [11959] 13: LOG: terminating any other
active server processes

When running without assertions, the client reports experiencing
tables with duplicate primary keys among other things. It's apparently
quite reproducible.

I'm digging into this, but it's a nasty bug and any useful thoughts
would be appreciated.

and here the stack trace:

#0 0x000000361ba36285 in __GI_raise (sig=6) at
../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1 0x000000361ba37b9b in __GI_abort () at abort.c:91
#2 0x000000000075c157 in ExceptionalCondition
(conditionName=<optimized out>, errorType=<optimized out>,
fileName=<optimized out>, lineNumber=<optimized out>)
at /home/andrew/pgl/pg_9_3/src/backend/utils/error/assert.c:54
#3 0x000000000048c2af in MultiXactIdGetUpdateXid (xmax=<optimized
out>, t_infomask=<optimized out>) at
/home/andrew/pgl/pg_9_3/src/backend/access/heap/heapam.c:5873
#4 0x000000000078ad50 in HeapTupleSatisfiesMVCC
(tuple=0x7feb3be8a790, snapshot=0x1025d70, buffer=2172) at
/home/andrew/pgl/pg_9_3/src/backend/utils/time/tqual.c:1221
#5 0x000000000048aad2 in heapgetpage (scan=0x11522f0, page=6) at
/home/andrew/pgl/pg_9_3/src/backend/access/heap/heapam.c:397
#6 0x000000000048aee9 in heapgettup_pagemode (scan=0x11522f0,
dir=<optimized out>, nkeys=0, key=0x0) at
/home/andrew/pgl/pg_9_3/src/backend/access/heap/heapam.c:939
#7 0x000000000048d646 in heap_getnext (scan=0x11522f0,
direction=<optimized out>) at
/home/andrew/pgl/pg_9_3/src/backend/access/heap/heapam.c:1459
#8 0x00000000005becab in SeqNext (node=<optimized out>) at
/home/andrew/pgl/pg_9_3/src/backend/executor/nodeSeqscan.c:66
#9 0x00000000005ac66e in ExecScanFetch (recheckMtd=0x5bec70
<SeqRecheck>, accessMtd=0x5bec80 <SeqNext>, node=0x1151488) at
/home/andrew/pgl/pg_9_3/src/backend/executor/execScan.c:82
#10 ExecScan (node=0x1151488, accessMtd=0x5bec80 <SeqNext>,
recheckMtd=0x5bec70 <SeqRecheck>) at
/home/andrew/pgl/pg_9_3/src/backend/executor/execScan.c:167
#11 0x00000000005a5338 in ExecProcNode (node=0x1151488) at
/home/andrew/pgl/pg_9_3/src/backend/executor/execProcnode.c:400
#12 0x00000000005b9bcf in ExecLockRows (node=0x1151278) at
/home/andrew/pgl/pg_9_3/src/backend/executor/nodeLockRows.c:56
#13 0x00000000005a51d0 in ExecProcNode (node=0x1151278) at
/home/andrew/pgl/pg_9_3/src/backend/executor/execProcnode.c:496
#14 0x00000000005a250a in ExecutePlan (dest=0xb952e0,
direction=<optimized out>, numberTuples=1, sendTuples=1 '\001',
operation=CMD_SELECT, planstate=0x1151278, estate=0x1151038)
at /home/andrew/pgl/pg_9_3/src/backend/executor/execMain.c:1473
#15 standard_ExecutorRun (queryDesc=0x114f230, direction=<optimized
out>, count=1) at
/home/andrew/pgl/pg_9_3/src/backend/executor/execMain.c:307
#16 0x00007feb5cf2f27d in explain_ExecutorRun (queryDesc=0x114f230,
direction=ForwardScanDirection, count=1) at
/home/andrew/pgl/pg_9_3/contrib/auto_explain/auto_explain.c:233
#17 0x00007feb5cd2a235 in pgss_ExecutorRun (queryDesc=0x114f230,
direction=ForwardScanDirection, count=1) at
/home/andrew/pgl/pg_9_3/contrib/pg_stat_statements/pg_stat_statements.c:717
#18 0x00000000005c8b97 in _SPI_pquery (tcount=1, fire_triggers=0
'\000', queryDesc=<optimized out>) at
/home/andrew/pgl/pg_9_3/src/backend/executor/spi.c:2369
#19 _SPI_execute_plan (plan=0x10afac8, paramLI=0x114f028,
snapshot=0x0, crosscheck_snapshot=0x0, read_only=0 '\000',
fire_triggers=0 '\000', tcount=1)
at /home/andrew/pgl/pg_9_3/src/backend/executor/spi.c:2157
#20 0x00000000005c8fd1 in SPI_execute_snapshot (plan=0x10afac8,
Values=0x7fff147d8500, Nulls=0x7fff147d8700 " ", snapshot=0x0,
crosscheck_snapshot=0x0, read_only=0 '\000',
fire_triggers=0 '\000', tcount=1) at
/home/andrew/pgl/pg_9_3/src/backend/executor/spi.c:488
#21 0x0000000000722635 in ri_PerformCheck (riinfo=0xf98168,
qkey=0x7fff147d8a20, qplan=0x10afac8, fk_rel=0x7feb39048048,
pk_rel=0x7feb38ff9b40, old_tuple=0x0, new_tuple=0x7fff147d9010,
detectNewRows=0 '\000', expect_OK=5) at
/home/andrew/pgl/pg_9_3/src/backend/utils/adt/ri_triggers.c:3094
#22 0x0000000000722cfc in RI_FKey_check (trigdata=<optimized out>)
at /home/andrew/pgl/pg_9_3/src/backend/utils/adt/ri_triggers.c:440
#23 0x0000000000588add in ExecCallTriggerFunc
(trigdata=0x7fff147d8fe0, tgindx=4, finfo=0x11409a8, instr=0x0,
per_tuple_context=0x1198d28)
at /home/andrew/pgl/pg_9_3/src/backend/commands/trigger.c:1871
#24 0x0000000000588f6b in AfterTriggerExecute
(per_tuple_context=0x1198d28, instr=0x0, finfo=0x11408e8,
trigdesc=0x11401f8, rel=0x7feb39048048, event=0x114d038)
at /home/andrew/pgl/pg_9_3/src/backend/commands/trigger.c:3454
#25 afterTriggerInvokeEvents (events=0x1025620, firing_id=1,
estate=0x113ffb8, delete_ok=1 '\001') at
/home/andrew/pgl/pg_9_3/src/backend/commands/trigger.c:3633
#26 0x000000000058e873 in AfterTriggerEndQuery (estate=0x113ffb8) at
/home/andrew/pgl/pg_9_3/src/backend/commands/trigger.c:3823
#27 0x00000000005a26f4 in standard_ExecutorFinish
(queryDesc=0x1076d08) at
/home/andrew/pgl/pg_9_3/src/backend/executor/execMain.c:379
#28 0x00007feb5cf2f1a5 in explain_ExecutorFinish
(queryDesc=0x1076d08) at
/home/andrew/pgl/pg_9_3/contrib/auto_explain/auto_explain.c:256
#29 0x00007feb5cd2a163 in pgss_ExecutorFinish (queryDesc=0x1076d08)
at
/home/andrew/pgl/pg_9_3/contrib/pg_stat_statements/pg_stat_statements.c:740
#30 0x00000000005c8be4 in _SPI_pquery (tcount=2, fire_triggers=1
'\001', queryDesc=<optimized out>) at
/home/andrew/pgl/pg_9_3/src/backend/executor/spi.c:2381
#31 _SPI_execute_plan (plan=0x10905a8, paramLI=0x112ab10,
snapshot=0x0, crosscheck_snapshot=0x0, read_only=0 '\000',
fire_triggers=1 '\001', tcount=2)
at /home/andrew/pgl/pg_9_3/src/backend/executor/spi.c:2157
#32 0x00000000005c909f in SPI_execute_plan_with_paramlist
(plan=0x10905a8, params=0x112ab10, read_only=0 '\000', tcount=2) at
/home/andrew/pgl/pg_9_3/src/backend/executor/spi.c:449
#33 0x00007feb3908d877 in exec_stmt_execsql (estate=0x7fff147d9700,
stmt=0x10630f0) at
/home/andrew/pgl/pg_9_3/src/pl/plpgsql/src/pl_exec.c:3293
#34 0x00007feb3908ee9e in exec_stmt (stmt=0x10630f0,
estate=0x7fff147d9700) at
/home/andrew/pgl/pg_9_3/src/pl/plpgsql/src/pl_exec.c:1446
#35 exec_stmts (estate=0x7fff147d9700, stmts=<optimized out>) at
/home/andrew/pgl/pg_9_3/src/pl/plpgsql/src/pl_exec.c:1345
#36 0x00007feb3909157b in exec_stmt_block (estate=0x7fff147d9700,
block=0x1063c68) at
/home/andrew/pgl/pg_9_3/src/pl/plpgsql/src/pl_exec.c:1283
#37 0x00007feb390921dd in plpgsql_exec_function (func=0x106bbe8,
fcinfo=0x7fff147d9940) at
/home/andrew/pgl/pg_9_3/src/pl/plpgsql/src/pl_exec.c:321
#38 0x00007feb39086ff4 in plpgsql_call_handler
(fcinfo=0x7fff147d9940) at
/home/andrew/pgl/pg_9_3/src/pl/plpgsql/src/pl_handler.c:129
#39 0x00000000005aa632 in ExecMakeTableFunctionResult
(funcexpr=0x1122070, econtext=0x1121b98, expectedDesc=0x1121f18,
randomAccess=0 '\000')
at /home/andrew/pgl/pg_9_3/src/backend/executor/execQual.c:2164
#40 0x00000000005be0c1 in FunctionNext (node=0x1121a80) at
/home/andrew/pgl/pg_9_3/src/backend/executor/nodeFunctionscan.c:64
#41 0x00000000005ac74e in ExecScanFetch (recheckMtd=0x5be040
<FunctionRecheck>, accessMtd=0x5be050 <FunctionNext>, node=0x1121a80)
at /home/andrew/pgl/pg_9_3/src/backend/executor/execScan.c:82
#42 ExecScan (node=0x1121a80, accessMtd=0x5be050 <FunctionNext>,
recheckMtd=0x5be040 <FunctionRecheck>) at
/home/andrew/pgl/pg_9_3/src/backend/executor/execScan.c:132
#43 0x00000000005a52d8 in ExecProcNode (node=0x1121a80) at
/home/andrew/pgl/pg_9_3/src/backend/executor/execProcnode.c:426
#44 0x00000000005a250a in ExecutePlan (dest=0xeeaa40,
direction=<optimized out>, numberTuples=0, sendTuples=1 '\001',
operation=CMD_SELECT, planstate=0x1121a80, estate=0x1121968)
at /home/andrew/pgl/pg_9_3/src/backend/executor/execMain.c:1473
#45 standard_ExecutorRun (queryDesc=0x1029e60, direction=<optimized
out>, count=0) at
/home/andrew/pgl/pg_9_3/src/backend/executor/execMain.c:307
#46 0x00007feb5cf2f27d in explain_ExecutorRun (queryDesc=0x1029e60,
direction=ForwardScanDirection, count=0) at
/home/andrew/pgl/pg_9_3/contrib/auto_explain/auto_explain.c:233
#47 0x00007feb5cd2a235 in pgss_ExecutorRun (queryDesc=0x1029e60,
direction=ForwardScanDirection, count=0) at
/home/andrew/pgl/pg_9_3/contrib/pg_stat_statements/pg_stat_statements.c:717
#48 0x000000000068fe77 in PortalRunSelect (portal=0xf2d598,
forward=<optimized out>, count=0, dest=<optimized out>) at
/home/andrew/pgl/pg_9_3/src/backend/tcop/pquery.c:946
#49 0x00000000006913c1 in PortalRun (portal=0xf2d598,
count=9223372036854775807, isTopLevel=1 '\001', dest=0xeeaa40,
altdest=0xeeaa40, completionTag=0x7fff147da470 "")
at /home/andrew/pgl/pg_9_3/src/backend/tcop/pquery.c:790
#50 0x000000000068ec50 in exec_execute_message (max_rows=<optimized
out>, portal_name=0xeea628 "") at
/home/andrew/pgl/pg_9_3/src/backend/tcop/postgres.c:1931
#51 PostgresMain (argc=<optimized out>, argv=<optimized out>,
dbname=0xe3e0c8 "eventjobs", username=<optimized out>) at
/home/andrew/pgl/pg_9_3/src/backend/tcop/postgres.c:4060
#52 0x000000000064465c in BackendRun (port=0xe63490) at
/home/andrew/pgl/pg_9_3/src/backend/postmaster/postmaster.c:4011
#53 BackendStartup (port=0xe63490) at
/home/andrew/pgl/pg_9_3/src/backend/postmaster/postmaster.c:3685
#54 ServerLoop () at
/home/andrew/pgl/pg_9_3/src/backend/postmaster/postmaster.c:1586
#55 PostmasterMain (argc=<optimized out>, argv=<optimized out>) at
/home/andrew/pgl/pg_9_3/src/backend/postmaster/postmaster.c:1253
#56 0x000000000045c897 in main (argc=3, argv=0xe3d270) at
/home/andrew/pgl/pg_9_3/src/backend/main/main.c:226

cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Andrew Dunstan (#2)
Re: assertion failure 9.3.4

Andrew Dunstan wrote:

and here the stack trace:

#0 0x000000361ba36285 in __GI_raise (sig=6) at
../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1 0x000000361ba37b9b in __GI_abort () at abort.c:91
#2 0x000000000075c157 in ExceptionalCondition
(conditionName=<optimized out>, errorType=<optimized out>,
fileName=<optimized out>, lineNumber=<optimized out>)
at /home/andrew/pgl/pg_9_3/src/backend/utils/error/assert.c:54
#3 0x000000000048c2af in MultiXactIdGetUpdateXid (xmax=<optimized
out>, t_infomask=<optimized out>) at
/home/andrew/pgl/pg_9_3/src/backend/access/heap/heapam.c:5873
#4 0x000000000078ad50 in HeapTupleSatisfiesMVCC
(tuple=0x7feb3be8a790, snapshot=0x1025d70, buffer=2172) at
/home/andrew/pgl/pg_9_3/src/backend/utils/time/tqual.c:1221

Clearly this is a bug related to multixacts and the related tqual.c
changse. Will look.

--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Andrew Dunstan
andrew@dunslane.net
In reply to: Alvaro Herrera (#3)
Re: assertion failure 9.3.4

On 04/14/2014 10:02 PM, Alvaro Herrera wrote:

Andrew Dunstan wrote:

and here the stack trace:

#0 0x000000361ba36285 in __GI_raise (sig=6) at
../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1 0x000000361ba37b9b in __GI_abort () at abort.c:91
#2 0x000000000075c157 in ExceptionalCondition
(conditionName=<optimized out>, errorType=<optimized out>,
fileName=<optimized out>, lineNumber=<optimized out>)
at /home/andrew/pgl/pg_9_3/src/backend/utils/error/assert.c:54
#3 0x000000000048c2af in MultiXactIdGetUpdateXid (xmax=<optimized
out>, t_infomask=<optimized out>) at
/home/andrew/pgl/pg_9_3/src/backend/access/heap/heapam.c:5873
#4 0x000000000078ad50 in HeapTupleSatisfiesMVCC
(tuple=0x7feb3be8a790, snapshot=0x1025d70, buffer=2172) at
/home/andrew/pgl/pg_9_3/src/backend/utils/time/tqual.c:1221

Clearly this is a bug related to multixacts and the related tqual.c
changse. Will look.

OK. If you need me to test patches I can do so.

cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Andrew Dunstan (#2)
Re: assertion failure 9.3.4

So, from top to bottom I see the following elements:

* backend is executing a query
* this query is getting captured by pg_stat_statements
* the query is also getting captured by autoexplain, in chain from
pg_stat_statements
* autoexplain runs the query, which invokes a plpgsql function
* this plpgsql function runs another query, and this one is captured by
pg_stat_statements
* and also by autoexplain
* The autoexplain run of this inner query invokes a trigger
* .. which is a FK trigger, and therefore runs ri_PerformCheck which
runs another query
* This FK check query gets captured by pg_stat_statements
* and also by autoexplain, which runs it
* this autoexplain run of the third query invokes SeqNext
* this does a heap access, which uses HeapTupleSatisfiesMVCC
* this one tries to read the update XID and gets an assertion failure.

Oh my.

Would it be possible for you to provide a self-contained setup that
behaves like this? I think a "bt full" would be useful since it'd
provide the queries at each of the three stages.

I'm not quite clear on why the third query, the one in ri_PerformCheck,
is invoking a sequence. AFAICS it's this:

/* ----------
* The query string built is
* SELECT 1 FROM ONLY <pktable> x WHERE pkatt1 = $1 [AND ...]
* FOR KEY SHARE OF x
* The type id's for the $ parameters are those of the
* corresponding FK attributes.
* ----------
*/
so either I'm misreading the whole thing, or there is something very odd
going on here.

Are you aware of something unusual in the FK setups?

--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#5)
Re: assertion failure 9.3.4

Alvaro Herrera <alvherre@2ndquadrant.com> writes:

I'm not quite clear on why the third query, the one in ri_PerformCheck,
is invoking a sequence.

It's not --- SeqNext is the next-tuple function for a sequential scan.
Nothing to do with sequences.

Now, it *is* worth wondering why the heck a query on the table's primary
key is using a seqscan and not an indexscan. Maybe the planner thinks
there are just a few rows in the table? But the stack trace seems
unexceptional other than that.

I'm wondering if the combination of autoexplain and pg_stat_statements
is causing trouble.

Yeah, it would be real nice to see a self-contained test case for this.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#7Andrew Dunstan
andrew@dunslane.net
In reply to: Tom Lane (#6)
Re: assertion failure 9.3.4

On 04/16/2014 07:19 PM, Tom Lane wrote:

Alvaro Herrera <alvherre@2ndquadrant.com> writes:

I'm not quite clear on why the third query, the one in ri_PerformCheck,
is invoking a sequence.

It's not --- SeqNext is the next-tuple function for a sequential scan.
Nothing to do with sequences.

Now, it *is* worth wondering why the heck a query on the table's primary
key is using a seqscan and not an indexscan. Maybe the planner thinks
there are just a few rows in the table? But the stack trace seems
unexceptional other than that.

I'm wondering if the combination of autoexplain and pg_stat_statements
is causing trouble.

Yeah, it would be real nice to see a self-contained test case for this.

Well, that might be hard to put together, but I did try running without
pg_stat_statements and auto_explain loaded and the error did not occur.
Not sure where that gets us in terms of deciding on a culprit.

cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrew Dunstan (#7)
Re: assertion failure 9.3.4

Andrew Dunstan <andrew@dunslane.net> writes:

On 04/16/2014 07:19 PM, Tom Lane wrote:

Yeah, it would be real nice to see a self-contained test case for this.

Well, that might be hard to put together, but I did try running without
pg_stat_statements and auto_explain loaded and the error did not occur.
Not sure where that gets us in terms of deciding on a culprit.

Could we at least see the exact parameter settings for pg_stat_statements
and auto_explain? (And any other GUCs with nondefault values?)

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#9Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Andrew Dunstan (#7)
Re: assertion failure 9.3.4

Andrew Dunstan wrote:

On 04/16/2014 07:19 PM, Tom Lane wrote:

Alvaro Herrera <alvherre@2ndquadrant.com> writes:

I'm not quite clear on why the third query, the one in ri_PerformCheck,
is invoking a sequence.

It's not --- SeqNext is the next-tuple function for a sequential scan.
Nothing to do with sequences.

Now, it *is* worth wondering why the heck a query on the table's primary
key is using a seqscan and not an indexscan. Maybe the planner thinks
there are just a few rows in the table? But the stack trace seems
unexceptional other than that.

I'm wondering if the combination of autoexplain and pg_stat_statements
is causing trouble.

Yeah, it would be real nice to see a self-contained test case for this.

Well, that might be hard to put together, but I did try running
without pg_stat_statements and auto_explain loaded and the error did
not occur.

Well, can you get us the queries that are being run, and the schemas
involved in those queries?

Not sure where that gets us in terms of deciding on a
culprit.

I suspect, in the blind, that autoexplain is executing a query that
causes a second update XID to be added to a multixact (which already has
another one), perhaps with the same snapshot as the original update.
Why would this be happening in a FK-check query? No idea.

--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10Andrew Dunstan
andrew@dunslane.net
In reply to: Tom Lane (#8)
Re: assertion failure 9.3.4

On 04/16/2014 10:28 PM, Tom Lane wrote:

Andrew Dunstan <andrew@dunslane.net> writes:

On 04/16/2014 07:19 PM, Tom Lane wrote:

Yeah, it would be real nice to see a self-contained test case for this.

Well, that might be hard to put together, but I did try running without
pg_stat_statements and auto_explain loaded and the error did not occur.
Not sure where that gets us in terms of deciding on a culprit.

Could we at least see the exact parameter settings for pg_stat_statements
and auto_explain? (And any other GUCs with nondefault values?)

Here are all the settings from the run that failed:

listen_addresses = '*'
port = 5432
fsync = on
synchronous_commit = off
checkpoint_segments = 128
checkpoint_completion_target = 0.9
shared_buffers = 512MB
max_connections = 300
work_mem = 128MB
maintenance_work_mem = 32MB
effective_cache_size = 16GB
effective_io_concurrency = 2
logging_collector = on
log_destination = 'stderr'
log_filename = 'postgresql-%a.log'
log_rotation_size = 0
log_truncate_on_rotation = on
log_line_prefix = '%t [%p] %l: '
log_connections = on
log_disconnections = on
log_statement = 'all'
track_activity_query_size = 10240
shared_preload_libraries = 'auto_explain,pg_stat_statements'

As you can see, auto_explain's log_min_duration hasn't been set, so it
shouldn't be doing anything very much, I should think.

cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11Josh Berkus
josh@agliodbs.com
In reply to: Andrew Dunstan (#1)
Re: assertion failure 9.3.4

All,

So have encountered a 2nd report of this issue, or of an issue which
sounds very similar:

- corruption in two "queue" tables

- the tables are written in a high-concurrency, lock-contested environment

- user uses SELECT FOR UPDATE with these tables.

- pg_stat_statements .so is loaded, but the extension is not installed

- four rows were added to the queue tables, while not being added to the
PK index. This allowed duplicate PKs to be added.

Currently the user is testing not loading the pg_stat_statements.so to
see if the problem goes away. They have a destruction test environment,
so we should be able to confirm/deny in a couple days.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

In reply to: Andrew Dunstan (#10)
Re: assertion failure 9.3.4

On Thu, Apr 17, 2014 at 7:15 AM, Andrew Dunstan <andrew@dunslane.net> wrote:

track_activity_query_size = 10240
shared_preload_libraries = 'auto_explain,pg_stat_statements'

As you can see, auto_explain's log_min_duration hasn't been set, so it
shouldn't be doing anything very much, I should think.

track_activity_query_size is 10240? Not 1024?

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13Andrew Dunstan
andrew@dunslane.net
In reply to: Peter Geoghegan (#12)
Re: assertion failure 9.3.4

On 04/17/2014 09:04 PM, Peter Geoghegan wrote:

On Thu, Apr 17, 2014 at 7:15 AM, Andrew Dunstan <andrew@dunslane.net> wrote:

track_activity_query_size = 10240
shared_preload_libraries = 'auto_explain,pg_stat_statements'

As you can see, auto_explain's log_min_duration hasn't been set, so it
shouldn't be doing anything very much, I should think.

track_activity_query_size is 10240? Not 1024?

yes.

I normally find 1024 is far too small.

cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14Andrew Dunstan
andrew@dunslane.net
In reply to: Andrew Dunstan (#10)
Re: assertion failure 9.3.4

On 04/17/2014 10:15 AM, Andrew Dunstan wrote:

On 04/16/2014 10:28 PM, Tom Lane wrote:

Andrew Dunstan <andrew@dunslane.net> writes:

On 04/16/2014 07:19 PM, Tom Lane wrote:

Yeah, it would be real nice to see a self-contained test case for
this.

Well, that might be hard to put together, but I did try running without
pg_stat_statements and auto_explain loaded and the error did not occur.
Not sure where that gets us in terms of deciding on a culprit.

Could we at least see the exact parameter settings for
pg_stat_statements
and auto_explain? (And any other GUCs with nondefault values?)

Here are all the settings from the run that failed:

listen_addresses = '*'
port = 5432
fsync = on
synchronous_commit = off
checkpoint_segments = 128
checkpoint_completion_target = 0.9
shared_buffers = 512MB
max_connections = 300
work_mem = 128MB
maintenance_work_mem = 32MB
effective_cache_size = 16GB
effective_io_concurrency = 2
logging_collector = on
log_destination = 'stderr'
log_filename = 'postgresql-%a.log'
log_rotation_size = 0
log_truncate_on_rotation = on
log_line_prefix = '%t [%p] %l: '
log_connections = on
log_disconnections = on
log_statement = 'all'
track_activity_query_size = 10240
shared_preload_libraries = 'auto_explain,pg_stat_statements'

As you can see, auto_explain's log_min_duration hasn't been set, so it
shouldn't be doing anything very much, I should think.

There definitely seems to be something going on involving these two
pre-loaded modules. With both auto_explain and pg_stat_statements
preloaded I can reproduce the error fairly reliably. I have also
reproduced it, but less reliably, with auto_explain alone loaded. I have
not reproduced it with pg_stat_statements alone loaded, but Josh Berkus
has reported that another client has experienced something very similar
(duplicate PK errors on a non-assertion-enabled build) with
pg_stat_statements alone loaded.

cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15Josh Berkus
josh@agliodbs.com
In reply to: Andrew Dunstan (#1)
Re: assertion failure 9.3.4

On 04/18/2014 09:42 AM, Andrew Dunstan wrote:

There definitely seems to be something going on involving these two
pre-loaded modules. With both auto_explain and pg_stat_statements
preloaded I can reproduce the error fairly reliably. I have also
reproduced it, but less reliably, with auto_explain alone loaded. I have
not reproduced it with pg_stat_statements alone loaded, but Josh Berkus
has reported that another client has experienced something very similar
(duplicate PK errors on a non-assertion-enabled build) with
pg_stat_statements alone loaded.

Correct. However, this seems to produce the issue less "reliably";
that is, it takes several hours of load testing for a duplicate PK to
show up, so I suspect that with pg_stat_statements alone it's also
timing issue. I'm still waiting on the results with
pg_stat_statements.so NOT loaded to confirm that we're not actually
getting two separate bugs with similar symptoms.

The second site does not have any increase in query size. Their only
settings for pg_s_s are:

pg_stat_statements.max = 10000
pg_stat_statements.track = all

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16Josh Berkus
josh@agliodbs.com
In reply to: Andrew Dunstan (#1)
Re: assertion failure 9.3.4

All,

More on this:

1) I've confirmed at the 2nd site that the issue doesn't happen if
pg_stat_statements.so is not loaded. So this seems to be confirmation
that either auto_explain, pg_stat_statements, or both need to be loaded
(but not necessarily created as extensions) in order to have the issue.

2) I've been unable to reproduce the issue using a naive test case, but
I'll keep trying.

3) The XID information is interesting.

a) I've confirmed that this is a case of having multiple row versions
rather than allowing a duplicate PK to be inserted.

b) the end result of this corruption is XIDs which go backwards:

xmin | xmax | ctid | step_id | workitem_id | status
---------+---------+-----------+---------+-------------+---------
3362707 | 3362707 | (6177,31) | 1 | 446469 | pending
3362710 | 3362707 | (6177,32) | 1 | 446469 | working

5520953 | 5520953 | (5064,105) | 1 | 727946 | pending
5520954 | 5520953 | (5064,108) | 1 | 727946 | working

What's additionally problematic is that the current snapshot minxid is
in the 9000000 range, so it's not clear why any of the above rows are
visible at all.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17Stephen Frost
sfrost@snowman.net
In reply to: Josh Berkus (#16)
Re: assertion failure 9.3.4

* Josh Berkus (josh@agliodbs.com) wrote:

1) I've confirmed at the 2nd site that the issue doesn't happen if
pg_stat_statements.so is not loaded. So this seems to be confirmation
that either auto_explain, pg_stat_statements, or both need to be loaded
(but not necessarily created as extensions) in order to have the issue.

Interesting...

2) I've been unable to reproduce the issue using a naive test case, but
I'll keep trying.

Thanks.

3) The XID information is interesting.

a) I've confirmed that this is a case of having multiple row versions
rather than allowing a duplicate PK to be inserted.

b) the end result of this corruption is XIDs which go backwards:

xmin | xmax | ctid | step_id | workitem_id | status
---------+---------+-----------+---------+-------------+---------
3362707 | 3362707 | (6177,31) | 1 | 446469 | pending
3362710 | 3362707 | (6177,32) | 1 | 446469 | working

5520953 | 5520953 | (5064,105) | 1 | 727946 | pending
5520954 | 5520953 | (5064,108) | 1 | 727946 | working

What's additionally problematic is that the current snapshot minxid is
in the 9000000 range, so it's not clear why any of the above rows are
visible at all.

Can you get the infomask bits..? What's does pg_controldata report wrt
the MultiXid's?

THanks,

Stephen

#18Josh Berkus
josh@agliodbs.com
In reply to: Alvaro Herrera (#5)
Re: assertion failure 9.3.4

Can you get the infomask bits..? What's does pg_controldata report wrt
the MultiXid's?

Can't get the infomask bits.

pg_controldata attached, with some redactions. Unfortunately, it
appears that they've continued to do tests on this system, so the XID
counter has advanced somewhat.

pg_control version number: 937
Catalog version number: {redacted}
Database system identifier: {redacted}
Database cluster state: in production
pg_control last modified: Mon 21 Apr 2014 04:28:30 PM UTC
Latest checkpoint location: 8/C8B05F98
Prior checkpoint location: 8/C663A4B0
Latest checkpoint's REDO location: 8/C6DC9A10
Latest checkpoint's REDO WAL file: 0000000300000008000000C6
Latest checkpoint's TimeLineID: 3
Latest checkpoint's PrevTimeLineID: 3
Latest checkpoint's full_page_writes: on
Latest checkpoint's NextXID: 0/11214375
Latest checkpoint's NextOID: 32769
Latest checkpoint's NextMultiXactId: 28939
Latest checkpoint's NextMultiOffset: 58004
Latest checkpoint's oldestXID: 674
Latest checkpoint's oldestXID's DB: 1
Latest checkpoint's oldestActiveXID: 11214375
Latest checkpoint's oldestMultiXid: 1
Latest checkpoint's oldestMulti's DB: 1
Time of latest checkpoint: Mon 21 Apr 2014 04:21:02 PM UTC
Fake LSN counter for unlogged rels: 0/1
Minimum recovery ending location: 0/0
Min recovery ending loc's timeline: 0
Backup start location: 0/0
Backup end location: 0/0
End-of-backup record required: no
Current wal_level setting: hot_standby
Current max_connections setting: 100
Current max_prepared_xacts setting: 0
Current max_locks_per_xact setting: 64

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19Tom Lane
tgl@sss.pgh.pa.us
In reply to: Josh Berkus (#16)
Re: assertion failure 9.3.4

Josh Berkus <josh@agliodbs.com> writes:

1) I've confirmed at the 2nd site that the issue doesn't happen if
pg_stat_statements.so is not loaded. So this seems to be confirmation
that either auto_explain, pg_stat_statements, or both need to be loaded
(but not necessarily created as extensions) in order to have the issue.

I looked a little bit for possible interactions between those two modules,
and almost immediately noticed that pg_stat_statements is not being polite
enough to call any previous occupant of the post_parse_analyze_hook.
That's not the bug we're looking for, since auto_explain does not use
post_parse_analyze_hook, but it's a bug nonetheless.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20Andres Freund
andres@anarazel.de
In reply to: Andrew Dunstan (#1)
Re: assertion failure 9.3.4

Hi,

I spent the last two hours poking arounds in the environment Andrew
provided and I was able to reproduce the issue, find a assert to
reproduce it much faster and find a possible root cause.

Since the symptom of the problem seem to be multixacts with more than
one updating xid, I added a check to MultiXactIdCreateFromMembers()
preventing that. That requires to move ISUPDATE_from_mxstatus() to a
header, but I think we should definitely add such a assert.

As it turns out the problem is in the
else if (result == HeapTupleBeingUpdated && wait)
branch in (at least) heap_update(). When the problem is hit the
to-be-updated tuple originally has HEAP_XMIN_COMMITTED |
HEAP_XMAX_LOCK_ONLY | HEAP_XMAX_KEYSHR_LOCK set. So we release the
buffer lock, acquire the tuple lock, and reacquire the buffer lock. But
inbetween the locking backend has actually updated the tuple.
The code tries to protect against that with:
/*
* recheck the locker; if someone else changed the tuple while
* we weren't looking, start over.
*/
if ((oldtup.t_data->t_infomask & HEAP_XMAX_IS_MULTI) ||
!TransactionIdEquals(
HeapTupleHeaderGetRawXmax(oldtup.t_data),
xwait))
goto l2;

can_continue = true;
locker_remains = true;

and similar. The problem is that in Andrew's case the infomask changes
from 0x2192 to 0x2102 (i.e. it's a normal update afterwards), while xmax
stays the same. Ooops.
A bit later there's:
result = can_continue ? HeapTupleMayBeUpdated : HeapTupleUpdated;
So, from thereon we happily continue to update the tuple, thinking
there's no previous updater. Which obviously causes problems.

I've hacked^Wfixed this by changing the infomask test above into
infomask != oldtup.t_data->t_infomask in a couple of places. That seems
to be sufficient to survive the testcase a couple of times.

I am too hungry right now to think about a proper fix for this and
whether there's further problematic areas.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#21Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#20)
#22Josh Berkus
josh@agliodbs.com
In reply to: Andrew Dunstan (#1)
#23Andrew Dunstan
andrew@dunslane.net
In reply to: Tom Lane (#21)
#24Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#21)
#25Andres Freund
andres@anarazel.de
In reply to: Josh Berkus (#22)
#26Josh Berkus
josh@agliodbs.com
In reply to: Andrew Dunstan (#1)
#27Andrew Dunstan
andrew@dunslane.net
In reply to: Andres Freund (#20)
#28Josh Berkus
josh@agliodbs.com
In reply to: Andrew Dunstan (#1)
#29Andres Freund
andres@anarazel.de
In reply to: Andrew Dunstan (#27)
#30Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Andres Freund (#29)
#31Josh Berkus
josh@agliodbs.com
In reply to: Andrew Dunstan (#1)
#32Andrew Dunstan
andrew@dunslane.net
In reply to: Josh Berkus (#31)
#33Josh Berkus
josh@agliodbs.com
In reply to: Andrew Dunstan (#1)
#34Andres Freund
andres@anarazel.de
In reply to: Andrew Dunstan (#32)
#35Andres Freund
andres@anarazel.de
In reply to: Josh Berkus (#33)
#36Andres Freund
andres@anarazel.de
In reply to: Alvaro Herrera (#30)
#37Josh Berkus
josh@agliodbs.com
In reply to: Andrew Dunstan (#1)
#38Andres Freund
andres@anarazel.de
In reply to: Josh Berkus (#37)
#39Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Josh Berkus (#37)
#40Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Andres Freund (#36)
#41Josh Berkus
josh@agliodbs.com
In reply to: Andrew Dunstan (#1)
#42Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Alvaro Herrera (#40)
#43Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Alvaro Herrera (#42)