postmaster segfault when using SELECT on a table

Started by Karsten Desleralmost 18 years ago7 messagesbugs
Jump to latest
#1Karsten Desler
kd@link11.de

Hello,

I have a smallish postgres database that segfaults everytime when I try to
access a certain row in a certain column.

xxx=# select file_id from dbfiles offset 632531 limit 1;
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.

file_id is a varchar(40).

The database is a selfcompiled 8.2.6 on a x86_64 linux 2.6 machine
that was running well for about 8 months before it started exhibiting the
problem a couple days ago.
I tried upgrading to 8.2.7 but the problem is still occuring.

I have recompiled the postgres server without -O2 and with -g and have
captured a coredump. Here's a bt and the first section of a bt full.
If you need more information, please don't hesistate to contact me.

Core was generated by `postgres: postgres xxx [local] SELECT '.
Program terminated with signal 11, Segmentation fault.
#0 0x000000000067c01b in pglz_decompress (source=0x2b3ab8060910, dest=0xa57744 "d") at pg_lzcompress.c:678
678 *bp = bp[-off];
(gdb) bt
#0 0x000000000067c01b in pglz_decompress (source=0x2b3ab8060910, dest=0xa57744 "d") at pg_lzcompress.c:678
#1 0x00000000004613b6 in heap_tuple_untoast_attr (attr=0x2b3ab8060910) at tuptoaster.c:128
#2 0x00000000006a3a19 in pg_detoast_datum (datum=0x2b3ab8060910) at fmgr.c:1973
#3 0x00000000004423d0 in printtup (slot=0xa3ff78, self=0xa3de00) at printtup.c:317
#4 0x000000000053d184 in ExecSelect (slot=0xa3ff78, dest=0xa3de00, estate=0xa3fe00) at execMain.c:1310
#5 0x000000000053cfea in ExecutePlan (estate=0xa3fe00, planstate=0xa40120, operation=CMD_SELECT, numberTuples=0, direction=ForwardScanDirection, dest=0xa3de00) at execMain.c:1236
#6 0x000000000053b9aa in ExecutorRun (queryDesc=0xa335d0, direction=ForwardScanDirection, count=0) at execMain.c:241
#7 0x00000000005f9f4c in PortalRunSelect (portal=0xa4b6c0, forward=1 '\001', count=0, dest=0xa3de00) at pquery.c:831
#8 0x00000000005f9bc3 in PortalRun (portal=0xa4b6c0, count=9223372036854775807, dest=0xa3de00, altdest=0xa3de00, completionTag=0x7fff01602ac0 "") at pquery.c:656
#9 0x00000000005f4737 in exec_simple_query (query_string=0xa05100 "select file_id from dbfiles offset 632531 limit 1;") at postgres.c:939
#10 0x00000000005f8376 in PostgresMain (argc=4, argv=0x9672c8, username=0x967290 "postgres") at postgres.c:3424
#11 0x00000000005c4318 in BackendRun (port=0x9634d0) at postmaster.c:2934
#12 0x00000000005c38c5 in BackendStartup (port=0x9634d0) at postmaster.c:2561
#13 0x00000000005c154a in ServerLoop () at postmaster.c:1214
#14 0x00000000005c0f70 in PostmasterMain (argc=3, argv=0x946230) at postmaster.c:966
#15 0x0000000000568d76 in main (argc=3, argv=0x946230) at main.c:188
(gdb) bt full
#0 0x000000000067c01b in pglz_decompress (source=0x2b3ab8060910, dest=0xa57744 "d") at pg_lzcompress.c:678
dp = (const unsigned char *) 0x2b3ab80707ea "`"
dend = (const unsigned char *) 0x2b3abff0c03d "6.21.163"
bp = (unsigned char *) 0xa7b000 <Address 0xa7b000 out of bounds>
ctrl = 3 '\003'
ctrlc = 5
len = 5
off = 1633
destsize = 0
__func__ = "pglz_decompress"
--snip

Thanks in advance,
Karsten Desler

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Karsten Desler (#1)
Re: postmaster segfault when using SELECT on a table

Karsten Desler <kd@link11.de> writes:

I have a smallish postgres database that segfaults everytime when I try to
access a certain row in a certain column.

Looks like a corrupted-data issue to me. It might be interesting to
dump the page with pg_filedump and see if there's any apparent pattern
to the damage.

regards, tom lane

#3Karsten Desler
kd@link11.de
In reply to: Tom Lane (#2)
Re: postmaster segfault when using SELECT on a table

* Tom Lane wrote:

Karsten Desler <kd@link11.de> writes:

I have a smallish postgres database that segfaults everytime when I try to
access a certain row in a certain column.

Looks like a corrupted-data issue to me. It might be interesting to
dump the page with pg_filedump and see if there's any apparent pattern
to the damage.

Thanks, I'll try to play with pg_filedump later tonight.
I've never had problems with this (and many more) postgres servers regarding
corruption of on disk data structures and I'm perfectly fine with chalking it
up to hardware problems.

I don't know much about the postgres architecture and I don't know if bounds
checking on-disk values on a read makes a lot of sense since usually one
should be able to assume that there are no randomly flipped bits; but it
would've been nice to have a sensible log entry as to what really
happened.

Anyway, for future reference: Assuming that this is the only corruption,
can I just UPDATE (or DELETE and reINSERT) the offending entry (maybe with a
following REINDEX/VACUUM?) or do I need to restore a backup?
If possible, I'd prefer the UPDATE solution, of course, since it can be done
without any downtime.

Keep up the good work.

Best regards,
Karsten Desler

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Karsten Desler (#3)
Re: postmaster segfault when using SELECT on a table

Karsten Desler <kd@link11.de> writes:

I don't know much about the postgres architecture and I don't know if bounds
checking on-disk values on a read makes a lot of sense since usually one
should be able to assume that there are no randomly flipped bits; but it
would've been nice to have a sensible log entry as to what really
happened.

FWIW, there is code in CVS HEAD that detects simple cases of corrupt
compressed data, though it's anyone's guess if it would've caught your
example here.

Anyway, for future reference: Assuming that this is the only corruption,
can I just UPDATE (or DELETE and reINSERT) the offending entry (maybe with a
following REINDEX/VACUUM?) or do I need to restore a backup?

If only the one row is clobbered, you should be able to just delete and
re-insert it, assuming you can identify it in a way that doesn't crash
in itself (ctid is probably about the safest). Not sure if an UPDATE
would be safe.

My suspicion though is that you'll find that a large portion of that
page is damaged; that's usually what we've seen in such cases in the
past.

regards, tom lane

#5Zdenek Kotala
Zdenek.Kotala@Sun.COM
In reply to: Karsten Desler (#3)
Re: postmaster segfault when using SELECT on a table

Karsten Desler wrote:

I don't know much about the postgres architecture and I don't know if bounds
checking on-disk values on a read makes a lot of sense since usually one
should be able to assume that there are no randomly flipped bits; but it
would've been nice to have a sensible log entry as to what really
happened.

I attached backported patch from head to 8.2. You can try it. It has small
performance penalty, but it does not crash on corrupted data.

Zdenek

Attachments:

pg_lzcompress.patchtext/x-patch; name=pg_lzcompress.patchDownload+77-77
#6Zdenek Kotala
Zdenek.Kotala@Sun.COM
In reply to: Tom Lane (#4)
Re: postmaster segfault when using SELECT on a table

Tom Lane wrote:

My suspicion though is that you'll find that a large portion of that
page is damaged; that's usually what we've seen in such cases in the
past.

I think, It can happen only if corruption is less then TOAST chunk size. In
other case, page header or tuple header+chunk id should be corrupted and it
should be reported in another place.

Zdenek

#7Karsten Desler
kd@link11.de
In reply to: Zdenek Kotala (#5)
Re: postmaster segfault when using SELECT on a table

* Zdenek Kotala wrote:

Karsten Desler wrote:

I don't know much about the postgres architecture and I don't know if
bounds
checking on-disk values on a read makes a lot of sense since usually one
should be able to assume that there are no randomly flipped bits; but it
would've been nice to have a sensible log entry as to what really
happened.

I attached backported patch from head to 8.2. You can try it. It has small
performance penalty, but it does not crash on corrupted data.

Thank you very much! I have restored a backup of the corrupt postgres
data files on a second server and I can confirm that postmaster no
longer crashes with the patch applied.

Thanks,
Karsten