Failed assert ((data - start) == data_size) in heaptuple.c

Started by Brendan Jurdalmost 15 years ago9 messages
#1Brendan Jurd
direvus@gmail.com

Hi folks,

I am running a 9.0.3 Hot Standy + Streaming Replication slave which
occasionally segfaults (every 1-2 days). I rebuilt Postgres with
--enable-cassert and --enable-debug, switched on core dumping and
waited for some results.

The first crash since enabling debugging was a failed assert in heaptuple.c:

TRAP: FailedAssertion("!((data - start) == data_size)", File:
"heaptuple.c", Line: 255)
2011-04-07 04:20:20 EST LOG: server process (PID 32195) was
terminated by signal 6: Aborted
2011-04-07 04:20:20 EST LOG: terminating any other active server processes

For context, the only things running on this server are the slave
database, and a tomcat instance. The tomcat instance is the only
connection into this database, which continually runs through a series
of SELECTs (100ms sleep between each run).

The slave database is basically stock standard 9.0.3 config, apart
from the replication setup and shared_buffers increased to 2GB.

Here's the backtrace:

Core was generated by `postgres: backend surecast 127.0.0.1(37155)
SELECT '.
Program terminated with signal 6, Aborted.
#0 0x00007f40a93aba75 in *__GI_raise (sig=<value optimised out>)
at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
64 ../nptl/sysdeps/unix/sysv/linux/raise.c: No such file or directory.
in ../nptl/sysdeps/unix/sysv/linux/raise.c
(gdb) bt
#0 0x00007f40a93aba75 in *__GI_raise (sig=<value optimised out>)
at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1 0x00007f40a93af5c0 in *__GI_abort () at abort.c:92
#2 0x00000000006f861d in ExceptionalCondition (conditionName=<value
optimised out>,
errorType=<value optimised out>, fileName=<value optimised out>,
lineNumber=<value optimised out>) at assert.c:57
#3 0x0000000000459b07 in heap_form_minimal_tuple
(tupleDescriptor=0xf6bdd0, values=0x8,
isnull=0xf6c680 "") at heaptuple.c:1459
#4 0x0000000000580d12 in ExecFetchSlotMinimalTuple (slot=0xf6bb90) at
execTuples.c:684
#5 0x0000000000588d10 in ExecHashTableInsert (hashtable=0xf4c3b0, slot=0x7dc3,
hashvalue=6) at nodeHash.c:697
#6 0x0000000000589bf6 in MultiExecHash (node=<value optimised out>)
at nodeHash.c:123
#7 0x000000000058a9ab in ExecHashJoin (node=0xf24008) at nodeHashjoin.c:154
#8 0x00000000005788a8 in ExecProcNode (node=0xf24008) at execProcnode.c:427
#9 0x000000000058fb21 in ExecNestLoop (node=0xf8eb98) at nodeNestloop.c:120
#10 0x00000000005788c8 in ExecProcNode (node=0xf8eb98) at execProcnode.c:419
#11 0x000000000057756d in ExecutePlan (queryDesc=0xf8bc30,
direction=32195, count=0)
at execMain.c:1187
#12 standard_ExecutorRun (queryDesc=0xf8bc30, direction=32195,
count=0) at execMain.c:280
#13 0x0000000000642e28 in PortalRunSelect (portal=0xf11f88,
forward=<value optimised out>,
count=0, dest=0xe00120) at pquery.c:952
#14 0x00000000006442e9 in PortalRun (portal=<value optimised out>,
count=<value optimised out>, isTopLevel=<value optimised out>,
dest=<value optimised out>, altdest=<value optimised out>,
completionTag=<value optimised out>) at pquery.c:796
#15 0x00000000006419e3 in exec_execute_message (argc=<value optimised out>,
argv=<value optimised out>, username=<value optimised out>) at
postgres.c:2003
#16 PostgresMain (argc=<value optimised out>, argv=<value optimised out>,
username=<value optimised out>) at postgres.c:3988
#17 0x0000000000606a07 in BackendRun () at postmaster.c:3555
#18 BackendStartup () at postmaster.c:3242
#19 ServerLoop () at postmaster.c:1431
#20 0x000000000060931d in PostmasterMain (argc=14918336, argv=0xdfb160)
at postmaster.c:1092
#21 0x00000000005a9310 in main (argc=5, argv=0xdfb140) at main.c:188

Let me know if there is any additional information I can provide.

Cheers,
BJ

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Brendan Jurd (#1)
Re: Failed assert ((data - start) == data_size) in heaptuple.c

Brendan Jurd <direvus@gmail.com> writes:

TRAP: FailedAssertion("!((data - start) == data_size)", File:
"heaptuple.c", Line: 255)

[ scratches head ... ] That implies that heap_fill_tuple came to a
different conclusion about a tuple's data size than the immediately
preceding heap_compute_data_size. Which I would sure want to believe
is impossible. Have you checked for flaky memory on this machine?

regards, tom lane

#3Brendan Jurd
direvus@gmail.com
In reply to: Tom Lane (#2)
Re: Failed assert ((data - start) == data_size) in heaptuple.c

On 7 April 2011 16:56, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Brendan Jurd <direvus@gmail.com> writes:

TRAP: FailedAssertion("!((data - start) == data_size)", File:
"heaptuple.c", Line: 255)

[ scratches head ... ]  That implies that heap_fill_tuple came to a
different conclusion about a tuple's data size than the immediately
preceding heap_compute_data_size.  Which I would sure want to believe
is impossible.  Have you checked for flaky memory on this machine?

We are doing so now -- although the RAM is ECC and just a few months
old, so flakiness seems a distant possibility. I will report back
after we've given it a proper thrashing with memtest.

Cheers,
BJ

#4Craig Ringer
craig@postnewspapers.com.au
In reply to: Brendan Jurd (#3)
Re: Failed assert ((data - start) == data_size) in heaptuple.c

On 04/07/2011 03:07 PM, Brendan Jurd wrote:

On 7 April 2011 16:56, Tom Lane<tgl@sss.pgh.pa.us> wrote:

Brendan Jurd<direvus@gmail.com> writes:

TRAP: FailedAssertion("!((data - start) == data_size)", File:
"heaptuple.c", Line: 255)

[ scratches head ... ] That implies that heap_fill_tuple came to a
different conclusion about a tuple's data size than the immediately
preceding heap_compute_data_size. Which I would sure want to believe
is impossible. Have you checked for flaky memory on this machine?

We are doing so now -- although the RAM is ECC and just a few months
old, so flakiness seems a distant possibility. I will report back
after we've given it a proper thrashing with memtest.

Apparently bad RAM can also mean faulty CPU (bad cache, heat problems,
etc). memtest86 seems ... rough ... at best when it comes to finding
issues; I've had some systems run it for a day yet continuously segfault
in real-world use until the RAM was re-seated or swapped out.

--
Craig Ringer

#5Alvaro Herrera
alvherre@commandprompt.com
In reply to: Brendan Jurd (#1)
Re: Failed assert ((data - start) == data_size) in heaptuple.c

Excerpts from Brendan Jurd's message of jue abr 07 03:07:32 -0300 2011:

Hi folks,

I am running a 9.0.3 Hot Standy + Streaming Replication slave which
occasionally segfaults (every 1-2 days). I rebuilt Postgres with
--enable-cassert and --enable-debug, switched on core dumping and
waited for some results.

What's the platform, and what's the query? Are there funny datatypes
involved?

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

#6Brendan Jurd
direvus@gmail.com
In reply to: Alvaro Herrera (#5)
Re: Failed assert ((data - start) == data_size) in heaptuple.c

On 8 April 2011 00:16, Alvaro Herrera <alvherre@commandprompt.com> wrote:

Excerpts from Brendan Jurd's message of jue abr 07 03:07:32 -0300 2011:

I am running a 9.0.3 Hot Standy + Streaming Replication slave which
occasionally segfaults (every 1-2 days).  I rebuilt Postgres with
--enable-cassert and --enable-debug, switched on core dumping and
waited for some results.

What's the platform, and what's the query?  Are there funny datatypes
involved?

Ubuntu 10.04 x64 on:

HP DL380R05
1x Quad Core Xeon E5440
10GB PC 5400 DDR ECC
2x HP 146GB 15krpm SAS drives in RAID 1+0

The tomcat instance repeatedly runs a series of some 9 queries, I'm
not sure which of the queries is the culprit or even whether it is the
same one each time. However, they are all straightforward SELECTs.
The one with the most complicated plan joins a whole six tables. I do
keep the transaction open until I have executed all the SELECTs in the
series, then commit and start over again with a fresh transaction.
That's just to make sure all of the queries are pulling data from the
same snapshot.

As for datatypes, I do have one type that I have defined which is used
in one of the queries. It's just an RGB colour value, defined as a
composite type:

CREATE DOMAIN colour_channel AS smallint
CHECK (VALUE >= 0 AND VALUE < 256);

CREATE TYPE rgb AS (
red colour_channel,
green colour_channel,
blue colour_channel
);

All of the user-defined functions I have written for this db are
either SQL or PL/pgSQL, and all of the functions called by these
queries are either STABLE or IMMUTABLE.

Cheers,
BJ

#7Brendan Jurd
direvus@gmail.com
In reply to: Tom Lane (#2)
Re: Failed assert ((data - start) == data_size) in heaptuple.c

On 7 April 2011 16:56, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Brendan Jurd <direvus@gmail.com> writes:

TRAP: FailedAssertion("!((data - start) == data_size)", File:
"heaptuple.c", Line: 255)

[ scratches head ... ]  That implies that heap_fill_tuple came to a
different conclusion about a tuple's data size than the immediately
preceding heap_compute_data_size.  Which I would sure want to believe
is impossible.  Have you checked for flaky memory on this machine?

Memtest didn't report any errors. I intend to try swapping out the
RAM tomorrow, but in the meantime we got a *different* assertion
failure today. The fact that we are tripping over various different
assertions seems to lend further weight to the flaky hardware
hypothesis.

TRAP: FailedAssertion("!(((lpp)->lp_flags == 1))", File: "heapam.c", Line: 727)

#0 0x00007f2773f23a75 in *__GI_raise (sig=<value optimised out>)
at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1 0x00007f2773f275c0 in *__GI_abort () at abort.c:92
#2 0x00000000006f9eed in ExceptionalCondition (conditionName=<value
optimised out>,
errorType=<value optimised out>, fileName=<value optimised out>,
lineNumber=<value optimised out>) at assert.c:57
#3 0x0000000000473641 in heapgettup_pagemode (scan=0x2366da8,
dir=<value optimised out>,
nkeys=<value optimised out>, key=<value optimised out>) at heapam.c:727
#4 0x0000000000474b16 in heap_getnext (scan=0x2366da8,
direction=1495) at heapam.c:1322
#5 0x0000000000590fcb in SeqNext (node=<value optimised out>) at
nodeSeqscan.c:66
#6 0x00000000005808ff in ExecScanFetch (node=0x22d5ff8,
accessMtd=<value optimised out>,
recheckMtd=<value optimised out>) at execScan.c:82
#7 ExecScan (node=0x22d5ff8, accessMtd=<value optimised out>,
recheckMtd=<value optimised out>) at execScan.c:164
#8 0x0000000000578d58 in ExecProcNode (node=0x22d5ff8) at execProcnode.c:378
#9 0x000000000058abf7 in ExecHashJoinOuterGetTuple (node=0x22d4a60)
at nodeHashjoin.c:562
#10 ExecHashJoin (node=0x22d4a60) at nodeHashjoin.c:187
#11 0x0000000000578ca8 in ExecProcNode (node=0x22d4a60) at execProcnode.c:427
#12 0x000000000058abf7 in ExecHashJoinOuterGetTuple (node=0x22d3430)
at nodeHashjoin.c:562
#13 ExecHashJoin (node=0x22d3430) at nodeHashjoin.c:187
#14 0x0000000000578ca8 in ExecProcNode (node=0x22d3430) at execProcnode.c:427
#15 0x0000000000590021 in ExecNestLoop (node=0x22d26d8) at nodeNestloop.c:120
#16 0x0000000000578cc8 in ExecProcNode (node=0x22d26d8) at execProcnode.c:419
#17 0x0000000000590021 in ExecNestLoop (node=0x22c0c88) at nodeNestloop.c:120
#18 0x0000000000578cc8 in ExecProcNode (node=0x22c0c88) at execProcnode.c:419
#19 0x0000000000591bf9 in ExecSort (node=0x22c0a50) at nodeSort.c:102
#20 0x0000000000578c88 in ExecProcNode (node=0x22c0a50) at execProcnode.c:438
#21 0x000000000057795e in ExecutePlan (queryDesc=0x23151f0,
direction=1495, count=0)
at execMain.c:1187
#22 standard_ExecutorRun (queryDesc=0x23151f0, direction=1495,
count=0) at execMain.c:280
#23 0x0000000000643d67 in PortalRunSelect (portal=0x229bf78,
forward=<value optimised out>, count=0, dest=0x218a120) at pquery.c:952
#24 0x0000000000645210 in PortalRun (portal=<value optimised out>,
count=<value optimised out>, isTopLevel=<value optimised out>,
dest=<value optimised out>, altdest=<value optimised out>,
completionTag=<value optimised out>) at pquery.c:796
#25 0x00000000006428dc in exec_execute_message (argc=<value optimised out>,
argv=<value optimised out>, username=<value optimised out>) at
postgres.c:2003
#26 PostgresMain (argc=<value optimised out>, argv=<value optimised out>,
username=<value optimised out>) at postgres.c:3988
#27 0x0000000000607351 in BackendRun () at postmaster.c:3555
#28 BackendStartup () at postmaster.c:3242
#29 ServerLoop () at postmaster.c:1431
#30 0x0000000000609c6d in PostmasterMain (argc=35406528, argv=0x2185160)
at postmaster.c:1092
#31 0x00000000005a99a0 in main (argc=5, argv=0x2185140) at main.c:188

#8Alvaro Herrera
alvherre@commandprompt.com
In reply to: Brendan Jurd (#7)
Re: Failed assert ((data - start) == data_size) in heaptuple.c

Excerpts from Brendan Jurd's message of vie abr 08 06:00:22 -0300 2011:

Memtest didn't report any errors. I intend to try swapping out the
RAM tomorrow, but in the meantime we got a *different* assertion
failure today. The fact that we are tripping over various different
assertions seems to lend further weight to the flaky hardware
hypothesis.

TRAP: FailedAssertion("!(((lpp)->lp_flags == 1))", File: "heapam.c", Line: 727)

Yep.

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

#9Brendan Jurd
direvus@gmail.com
In reply to: Alvaro Herrera (#8)
Re: Failed assert ((data - start) == data_size) in heaptuple.c

On 9 April 2011 00:41, Alvaro Herrera <alvherre@commandprompt.com> wrote:

Excerpts from Brendan Jurd's message of vie abr 08 06:00:22 -0300 2011:

Memtest didn't report any errors.  I intend to try swapping out the
RAM tomorrow, but in the meantime we got a *different* assertion
failure today.  The fact that we are tripping over various different
assertions seems to lend further weight to the flaky hardware
hypothesis.

TRAP: FailedAssertion("!(((lpp)->lp_flags == 1))", File: "heapam.c", Line: 727)

Yep.

I swapped the RAM with another machine, and after a few hours running
the other machine popped a segfault. The faulty RAM diagnosis is now
official, so I won't be bothering you folks about this any further.

Cheers,
BJ