seg fault crashed the postmaster

Started by Gordon Shannonover 15 years ago22 messagesgeneral

gordo169@gmail.com

over 15 years ago

Hi,

Running Centos, just upgraded our production db from 8.4.4 to 9.0.2 last
night. About 20 hours later, an update statement seg faulted and crashed
the server. This is a typical update that has worked fine for a long time.

20898 datafeed (58628) 2010-12-30 19:28:14 EST [103]LOG: process 20898
acquired ShareLock on transaction 1286738762 after 90174.969 ms
20898 datafeed (58628) 2010-12-30 19:28:14 EST [104]STATEMENT: update
v_messages set status = 'S', updated_on = now() where id in (select id from
v_messages where author_id = 34409854 and status != 'S' limit 10000)
5802 2010-12-30 19:28:14 EST [4]LOG: server process (PID 20898) was
terminated by signal 11: Segmentation fault
5802 2010-12-30 19:28:14 EST [5]LOG: terminating any other active server
processes
15426 pipeline (36834) 2010-12-30 19:28:14 EST [1]WARNING: terminating
connection because of crash of another server process15426 pipeline
10.10.11.54(36834) 2010-12-30 19:28:14 EST [2]DETAIL: The postmaster has
commanded this server process to roll back the current transaction and exit,
because another server pro
cess exited abnormally and possibly corrupted shared memory.

One hint that something might be wrong was this:

18235 datafeed (44228) 2010-12-30 18:16:37 EST [11]LOG: process 18235
acquired ShareLock on transaction 1285952031 after 29966.703 ms
18235 datafeed (44228) 2010-12-30 18:16:37 EST [12]STATEMENT: update
v_messages set status = 'S', updated_on = now() where id in (select id from
v_messages where author_id = 25301995 and status != 'S' limit 10000)
18235 datafeed (44228) 2010-12-30 18:16:43 EST [13]ERROR: compressed data
is corrupt
18235 datafeed (44228) 2010-12-30 18:16:43 EST [14]STATEMENT: update
v_messages set status = 'S', updated_on = now() where id in (select id from
v_messages where author_id = 25301995 and status != 'S' limit 10000)

How concerned should I be? Thanks!

Gordon

--
View this message in context: http://postgresql.1045698.n5.nabble.com/seg-fault-crashed-the-postmaster-tp3323117p3323117.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.

Tom Lane

tgl@sss.pgh.pa.us

over 15 years ago

In reply to: Gordon Shannon (#1)

Re: seg fault crashed the postmaster

Gordon Shannon <gordo169@gmail.com> writes:

Running Centos, just upgraded our production db from 8.4.4 to 9.0.2 last
night. About 20 hours later, an update statement seg faulted and crashed
the server. This is a typical update that has worked fine for a long time.

Could we see a stack trace from that? Or at least a self-contained
test case?

regards, tom lane

Gordon Shannon

gordo169@gmail.com

over 15 years ago

In reply to: Tom Lane (#2)

Re: seg fault crashed the postmaster

I'd love to send you a stack trace. Any suggestions on how to get one? It
has since happened again, on the same update command, so I'm guessing I can
repeat it.

On Thu, Dec 30, 2010 at 6:52 PM, Tom Lane-2 [via PostgreSQL] <
ml-node+3323151-436577542-56124@n5.nabble.com<ml-node%2B3323151-436577542-56124@n5.nabble.com>

wrote:

Gordon Shannon <[hidden email]<http://user/SendEmail.jtp?type=node&node=3323151&i=0>>
writes:

Running Centos, just upgraded our production db from 8.4.4 to 9.0.2 last
night. About 20 hours later, an update statement seg faulted and crashed

the server. This is a typical update that has worked fine for a long

time.

Could we see a stack trace from that? Or at least a self-contained
test case?

regards, tom lane

--
Sent via pgsql-general mailing list ([hidden email]<http://user/SendEmail.jtp?type=node&node=3323151&i=1>)

To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

------------------------------
View message @
http://postgresql.1045698.n5.nabble.com/seg-fault-crashed-the-postmaster-tp3323117p3323151.html
To unsubscribe from seg fault crashed the postmaster, click here<http://postgresql.1045698.n5.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=3323117&code=Z29yZG8xNjlAZ21haWwuY29tfDMzMjMxMTd8LTEwNjcwMjEwNDQ=>.

--
View this message in context: http://postgresql.1045698.n5.nabble.com/seg-fault-crashed-the-postmaster-tp3323117p3323171.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.

Tom Lane

tgl@sss.pgh.pa.us

over 15 years ago

In reply to: Gordon Shannon (#3)

Re: seg fault crashed the postmaster

Gordon Shannon <gordo169@gmail.com> writes:

I'd love to send you a stack trace. Any suggestions on how to get one? It
has since happened again, on the same update command, so I'm guessing I can
repeat it.

http://wiki.postgresql.org/wiki/Generating_a_stack_trace_of_a_PostgreSQL_backend

regards, tom lane

Gordon Shannon

gordo169@gmail.com

over 15 years ago

In reply to: Tom Lane (#4)

Re: seg fault crashed the postmaster

Stack trace:

#0 0x00000031a147c15c in memcpy () from /lib64/libc.so.6
#1 0x0000000000450cb8 in __memcpy_ichk (tuple=0x7fffb29ac900) at
/usr/include/bits/string3.h:51
#2 heap_copytuple (tuple=0x7fffb29ac900) at heaptuple.c:592
#3 0x0000000000543d4c in EvalPlanQualFetchRowMarks (epqstate=0x3cd85ab8) at
execMain.c:1794
#4 0x00000000005440db in EvalPlanQual (estate=0x1e0db420,
epqstate=0x3cd85ab8, relation=<value optimized out>, rti=4,
tid=0x7fffb29aca20, priorXmax=<value optimized out>) at execMain.c:1401
#5 0x00000000005592eb in ExecUpdate (node=0x3cd85a28) at
nodeModifyTable.c:527
#6 ExecModifyTable (node=0x3cd85a28) at nodeModifyTable.c:748
#7 0x0000000000545953 in ExecProcNode (node=0x3cd85a28) at
execProcnode.c:359
#8 0x0000000000544881 in ExecutePlan (queryDesc=0x1e727990,
direction=1039265768, count=0) at execMain.c:1190
#9 standard_ExecutorRun (queryDesc=0x1e727990, direction=1039265768,
count=0) at execMain.c:280
#10 0x00002ab002c0f2b5 in explain_ExecutorRun (queryDesc=0x1e727990,
direction=ForwardScanDirection, count=0) at auto_explain.c:203
#11 0x00000000005f9c81 in ProcessQuery (plan=0x2112ad60,
sourceText=0x1b3e59e0 "update v_messages set status = 'S', updated_on =
now() where id in (select id from v_messages where author_id = 33138761 and
status != 'S' limit 10000)",
params=<value optimized out>, dest=0x2112ae40,
completionTag=0x7fffb29ace20 "") at pquery.c:197
#12 0x00000000005f9e99 in PortalRunMulti (portal=0x1b32aed0, isTopLevel=1
'\001', dest=0x2112ae40, altdest=0x2112ae40, completionTag=0x7fffb29ace20
"") at pquery.c:1268
#13 0x00000000005fa965 in PortalRun (portal=0x1b32aed0,
count=9223372036854775807, isTopLevel=1 '\001', dest=0x2112ae40,
altdest=0x2112ae40, completionTag=0x7fffb29ace20 "") at pquery.c:822
#14 0x00000000005f7455 in exec_simple_query (
query_string=0x1b3e59e0 "update v_messages set status = 'S', updated_on
= now() where id in (select id from v_messages where author_id = 33138761
and status != 'S' limit 10000)")
at postgres.c:1058
#15 0x00000000005f7d14 in PostgresMain (argc=<value optimized out>,
argv=<value optimized out>, username=<value optimized out>) at
postgres.c:3929
#16 0x00000000005c7ce5 in ServerLoop () at postmaster.c:3555
#17 0x00000000005c89ec in PostmasterMain (argc=5, argv=0x1b31ea00) at
postmaster.c:1092
#18 0x00000000005725fe in main (argc=5, argv=<value optimized out>) at
main.c:188

On Thu, Dec 30, 2010 at 7:32 PM, Tom Lane-2 [via PostgreSQL] <
ml-node+3323177-1417305259-56124@n5.nabble.com<ml-node%2B3323177-1417305259-56124@n5.nabble.com>

wrote:

Gordon Shannon <[hidden email]<http://user/SendEmail.jtp?type=node&node=3323177&i=0>>
writes:

I'd love to send you a stack trace. Any suggestions on how to get one?

It

has since happened again, on the same update command, so I'm guessing I

can

repeat it.

http://wiki.postgresql.org/wiki/Generating_a_stack_trace_of_a_PostgreSQL_backend

regards, tom lane

--
Sent via pgsql-general mailing list ([hidden email]<http://user/SendEmail.jtp?type=node&node=3323177&i=1>)

To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

------------------------------
View message @
http://postgresql.1045698.n5.nabble.com/seg-fault-crashed-the-postmaster-tp3323117p3323177.html

To unsubscribe from seg fault crashed the postmaster, click here<http://postgresql.1045698.n5.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=3323117&code=Z29yZG8xNjlAZ21haWwuY29tfDMzMjMxMTd8LTEwNjcwMjEwNDQ=>.

--
If I had more time, I could have written you a shorter letter. (Blaise
Pascal)

--
View this message in context: http://postgresql.1045698.n5.nabble.com/seg-fault-crashed-the-postmaster-tp3323117p3323277.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.

Tom Lane

tgl@sss.pgh.pa.us

over 15 years ago

In reply to: Gordon Shannon (#5)

Re: seg fault crashed the postmaster

Gordon Shannon <gordo169@gmail.com> writes:

Stack trace:

#0 0x00000031a147c15c in memcpy () from /lib64/libc.so.6
#1 0x0000000000450cb8 in __memcpy_ichk (tuple=0x7fffb29ac900) at
/usr/include/bits/string3.h:51
#2 heap_copytuple (tuple=0x7fffb29ac900) at heaptuple.c:592
#3 0x0000000000543d4c in EvalPlanQualFetchRowMarks (epqstate=0x3cd85ab8) at
execMain.c:1794
#4 0x00000000005440db in EvalPlanQual (estate=0x1e0db420,
epqstate=0x3cd85ab8, relation=<value optimized out>, rti=4,
tid=0x7fffb29aca20, priorXmax=<value optimized out>) at execMain.c:1401
#5 0x00000000005592eb in ExecUpdate (node=0x3cd85a28) at
nodeModifyTable.c:527
#6 ExecModifyTable (node=0x3cd85a28) at nodeModifyTable.c:748
#7 0x0000000000545953 in ExecProcNode (node=0x3cd85a28) at
execProcnode.c:359
#8 0x0000000000544881 in ExecutePlan (queryDesc=0x1e727990,
direction=1039265768, count=0) at execMain.c:1190
#9 standard_ExecutorRun (queryDesc=0x1e727990, direction=1039265768,
count=0) at execMain.c:280
#10 0x00002ab002c0f2b5 in explain_ExecutorRun (queryDesc=0x1e727990,
direction=ForwardScanDirection, count=0) at auto_explain.c:203
#11 0x00000000005f9c81 in ProcessQuery (plan=0x2112ad60,
sourceText=0x1b3e59e0 "update v_messages set status = 'S', updated_on =
now() where id in (select id from v_messages where author_id = 33138761 and
status != 'S' limit 10000)",
params=<value optimized out>, dest=0x2112ae40,
completionTag=0x7fffb29ace20 "") at pquery.c:197

Hmm. This suggests that there's something wrong in the EvalPlanQual
code, which gets invoked when there are concurrent updates to the same
row (ie, the row this UPDATE is trying to change is one that was changed
by some other transaction since the query started). That stuff got
rewritten rather thoroughly for 9.0, so the idea of a new bug there
isn't exactly surprising. But it's going to be hard to find without
a test case. Can you show us the full schema for this table and all
the queries that execute against it up till the point of the failure?
(Turning on log_statement across all sessions would help collect that
info, if you don't have it already.)

regards, tom lane

Gordon Shannon

gordo169@gmail.com

over 15 years ago

In reply to: Tom Lane (#6)

Re: seg fault crashed the postmaster

Interesting. That's exactly what we have been doing -- trying to update the
same rows in multiple txns. For us to proceed in production, I will take
steps to ensure we stop doing that, as it's just an app bug really.

The table in question -- v_messages -- is an empty base table with 76
partitions, with a total of 2.8 billion rows.
Let me summarize what I see as the key facts here:

(All problems have come from the UPDATE query, all identical except for
different "author_id" values.)

1. We did a "link" upgrade Wed night, from 844 to 902 so the upgrade
happened in place, no data files were copied.
2. The 1st error was "compressed data is corrupt" at 18:16
3. We got 2 seg fault crashes before turning on cores and getting a 3rd
crash with the stack trace.
4. We then got a " invalid memory alloc request size 18446744073449177092"
at 23:50. This was an ERROR, not a crash.

At this point, is it your suspicion that there is a code bug in 9.0.2,
rather than corrupt data?

I will post the schema and then work on a test case.

-gordon

On Fri, Dec 31, 2010 at 8:34 AM, Tom Lane-2 [via PostgreSQL] <
ml-node+3323712-1368244686-56124@n5.nabble.com<ml-node%2B3323712-1368244686-56124@n5.nabble.com>

wrote:

Hmm. This suggests that there's something wrong in the EvalPlanQual
code, which gets invoked when there are concurrent updates to the same
row (ie, the row this UPDATE is trying to change is one that was changed
by some other transaction since the query started). That stuff got
rewritten rather thoroughly for 9.0, so the idea of a new bug there
isn't exactly surprising. But it's going to be hard to find without
a test case. Can you show us the full schema for this table and all
the queries that execute against it up till the point of the failure?
(Turning on log_statement across all sessions would help collect that
info, if you don't have it already.)

--
View this message in context: http://postgresql.1045698.n5.nabble.com/seg-fault-crashed-the-postmaster-tp3323117p3323796.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.

Gordon Shannon

gordo169@gmail.com

over 15 years ago

In reply to: Gordon Shannon (#7)

Re: seg fault crashed the postmaster

Here is the ddl for the tables in question. There are foreign keys to other
tables that I omitted.

http://postgresql.1045698.n5.nabble.com/file/n3323804/parts.sql parts.sql
--
View this message in context: http://postgresql.1045698.n5.nabble.com/seg-fault-crashed-the-postmaster-tp3323117p3323804.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.

Tom Lane

tgl@sss.pgh.pa.us

over 15 years ago

In reply to: Gordon Shannon (#8)

Re: seg fault crashed the postmaster

Gordon Shannon <gordo169@gmail.com> writes:

Here is the ddl for the tables in question. There are foreign keys to other
tables that I omitted.

http://postgresql.1045698.n5.nabble.com/file/n3323804/parts.sql parts.sql

Hmmm ... what is "message_status_enum"? Is that an actual enum type, or
some kind of domain over text or varchar? If it's an enum, I'm a bit
baffled where the "corrupt compressed data" message could have come from.

regards, tom lane

#10

Gordon Shannon

gordo169@gmail.com

over 15 years ago

In reply to: Tom Lane (#9)

Re: seg fault crashed the postmaster

Sorry, I left that out. Yeah, I wondered that too, since these tables do
not use toast.

CREATE TYPE message_status_enum AS ENUM ( 'V', 'X', 'S', 'R', 'U', 'D' );

On Fri, Dec 31, 2010 at 12:38 PM, Tom Lane-2 [via PostgreSQL] <
ml-node+3323859-1425181809-56124@n5.nabble.com<ml-node%2B3323859-1425181809-56124@n5.nabble.com>

wrote:

Hmmm ... what is "message_status_enum"? Is that an actual enum type, or
some kind of domain over text or varchar? If it's an enum, I'm a bit
baffled where the "corrupt compressed data" message could have come from.

regards, tom lane

--

--
View this message in context: http://postgresql.1045698.n5.nabble.com/seg-fault-crashed-the-postmaster-tp3323117p3323888.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.

#11

Tom Lane

tgl@sss.pgh.pa.us

over 15 years ago

In reply to: Gordon Shannon (#10)

Re: seg fault crashed the postmaster

Gordon Shannon <gordo169@gmail.com> writes:

Sorry, I left that out. Yeah, I wondered that too, since these tables do
not use toast.

Hm. Well, given that the stack trace suggests we're trying to access a
tuple value that's not there (bogus pointer, or data overwritten since
the pointer was created), the "invalid memory alloc request size"
message is believable as another manifestation of the same problem.
The odds seem pretty good that the "corrupt compressed data" message
has the same origin at bottom, although the lack of any obvious data
to be compressed in this table is confusing. Maybe you could get that
from trying to copy over a garbage value of that one varchar column,
though.

I tried to replicate the problem here without success, which suggests to
me that maybe it's plan-dependent. Can you show what EXPLAIN gives you
for the troublesome query?

Another point here is that your logs suggest that the query blocked
behind the previous updater of the row for a very long time --- 90 sec
in one case, 30 sec in another. Is that the typical runtime for one of
these updates? If not, can you identify what the blocking query was?
If it's something different from another instance of the identical
update, it'd be useful to know what.

regards, tom lane

#12

Gordon Shannon

gordo169@gmail.com

over 15 years ago

In reply to: Tom Lane (#11)

Re: seg fault crashed the postmaster

Yes that query does take 30 or 90 secs. I'm pretty sure it was blocking on
its twin update running concurrently. However I'm not really sure how to
identify what "transaction 1283585646" was.

Enclosed is the query plan -- 21000 lines

-gordon

I tried to replicate the problem here without success, which suggests to

me that maybe it's plan-dependent. Can you show what EXPLAIN gives you
for the troublesome query?

Another point here is that your logs suggest that the query blocked
behind the previous updater of the row for a very long time --- 90 sec
in one case, 30 sec in another. Is that the typical runtime for one of
these updates? If not, can you identify what the blocking query was?
If it's something different from another instance of the identical
update, it'd be useful to know what.

--
View this message in context: http://postgresql.1045698.n5.nabble.com/seg-fault-crashed-the-postmaster-tp3323117p3323915.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.

#13

Tom Lane

tgl@sss.pgh.pa.us

over 15 years ago

In reply to: Gordon Shannon (#12)

Re: seg fault crashed the postmaster

Gordon Shannon <gordo169@gmail.com> writes:

Enclosed is the query plan -- 21000 lines

Um ... nope?

regards, tom lane

#14

Gordon Shannon

gordo169@gmail.com

over 15 years ago

In reply to: Tom Lane (#13)

Re: seg fault crashed the postmaster

Maybe it doesn't work from gmail. I'll try uploading from here.

http://postgresql.1045698.n5.nabble.com/file/n3323933/plan.txt plan.txt
--
View this message in context: http://postgresql.1045698.n5.nabble.com/seg-fault-crashed-the-postmaster-tp3323117p3323933.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.

#15

Tom Lane

tgl@sss.pgh.pa.us

over 15 years ago

In reply to: Tom Lane (#11)

Re: seg fault crashed the postmaster

I wrote:

The odds seem pretty good that the "corrupt compressed data" message
has the same origin at bottom, although the lack of any obvious data
to be compressed in this table is confusing. Maybe you could get that
from trying to copy over a garbage value of that one varchar column,
though.

Nah, that's overthinking the problem. Just before the previously
identified crash point, we have

td = DatumGetHeapTupleHeader(datum);

That's a pg_detoast_datum call underneath the macro. So if you assume
that the tuple-datum pointer is pointing at garbage, you can get all
three of the observed symptoms very easily, depending on what the
garbage passing for its length word happens to look like.

So I'm pretty sure that what we're dealing with is a case of the plan
freeing a transient tuple datum sooner than it should. But the toy case
I tried here didn't fail, so evidently I'm not close enough to the plan
you're actually using. Still need to see that EXPLAIN output. It'd be
helpful also to know what nondefault configuration settings you're
using, especially work_mem and planner-related settings. Also, do you
have an idea of how many rows might've matched the WHERE conditions?

regards, tom lane

#16

Gordon Shannon

gordo169@gmail.com

over 15 years ago

In reply to: Tom Lane (#15)

Re: seg fault crashed the postmaster

The number of matching rows on these queries is anything from 0 to 10000. I
don't think I can tell how many would have matched on the ones that
crashed. Although I suspect it would have been toward the 10000 end. I've
been trying to get a reproducable test case with no luck so far.

I assume you can now see the plan? I uploaded it twice, once via gmail and
once on Nabble.

Here are all the non-default settings:

listen_addresses = '*'
max_connections = 450
authentication_timeout = 20s
shared_buffers = 18GB
work_mem = 200MB
maintenance_work_mem = 8GB
shared_preload_libraries = 'auto_explain'
wal_level = hot_standby
synchronous_commit = off
wal_buffers = 8MB
commit_siblings = 1
checkpoint_segments = 256
checkpoint_warning = 5min
archive_mode = on
archive_command = 'cp %p /var/lib/pgsql/wal/%f.wrk; mv

/var/lib/pgsql/wal/%f.wrk /var/lib/pgsql/wal/%f'

max_wal_senders = 1
cpu_tuple_cost = 0.003
cpu_index_tuple_cost = 0.001
cpu_operator_cost = 0.0005
effective_cache_size = 52GB
default_statistics_target = 200
log_directory = '/var/log/postgres'
log_filename = 'pg%d.log'
log_min_duration_statement = 7min
log_line_prefix = '%p %u %r %t [%l]'
log_lock_waits = on
log_temp_files = 0
track_functions = pl # none, pl, all
log_autovacuum_min_duration = 5min
autovacuum_vacuum_scale_factor = 0.1
autovacuum_analyze_scale_factor = 0.03
autovacuum_freeze_max_age = 1500000000 # 1,500,000,000
autovacuum_vacuum_cost_delay = -1
temp_tablespaces =

'ts03,ts04,ts05,ts06,ts07,ts08,ts09,ts10,ts11,ts12,ts13,ts14,ts15,ts16,ts17,ts18,ts19,ts20,ts21,ts22,ts23,ts24,ts25,ts26,ts27,ts28,ts29,ts30,ts31,ts32,ts33,ts34,ts35,ts36,ts37,ts38'

vacuum_freeze_min_age = 500000000 # 500,000,000
vacuum_freeze_table_age = 1300000000 # 1,300,000,000
bytea_output = 'escape'
deadlock_timeout = 5s
standard_conforming_strings = on
custom_variable_classes = 'auto_explain'
auto_explain.log_min_duration = -1 # Remember this means for everybody!
auto_explain.log_analyze = off ## DANGER! Don't set log_analyze to

true unless you know what you're doing!

auto_explain.log_verbose = off
auto_explain.log_nested_statements = on

On Fri, Dec 31, 2010 at 2:49 PM, Tom Lane-2 [via PostgreSQL] <
ml-node+3323935-1680610224-56124@n5.nabble.com<ml-node%2B3323935-1680610224-56124@n5.nabble.com>

wrote:

So I'm pretty sure that what we're dealing with is a case of the plan
freeing a transient tuple datum sooner than it should. But the toy case
I tried here didn't fail, so evidently I'm not close enough to the plan
you're actually using. Still need to see that EXPLAIN output. It'd be
helpful also to know what nondefault configuration settings you're
using, especially work_mem and planner-related settings. Also, do you
have an idea of how many rows might've matched the WHERE conditions?

--
View this message in context: http://postgresql.1045698.n5.nabble.com/seg-fault-crashed-the-postmaster-tp3323117p3323959.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.

#17

Tom Lane

tgl@sss.pgh.pa.us

over 15 years ago

In reply to: Gordon Shannon (#16)

Re: seg fault crashed the postmaster

Gordon Shannon <gordo169@gmail.com> writes:

I assume you can now see the plan? I uploaded it twice, once via gmail and
once on Nabble.

Yeah, the Nabble one works. Now I'm even more confused, because the
whole-row var seems to be coming from the outside of the nestloop, which
is about the simplest possible case.

The number of matching rows on these queries is anything from 0 to 10000. I
don't think I can tell how many would have matched on the ones that
crashed. Although I suspect it would have been toward the 10000 end. I've
been trying to get a reproducable test case with no luck so far.

No luck here either. A couple further questions --- do you think it's
likely that you could have been over 10000 matches, ie, the LIMIT
actually kicked in? Also, could there have been more than two
concurrent updates acting on the same rows?

regards, tom lane

#18

Gordon Shannon

gordo169@gmail.com

over 15 years ago

In reply to: Tom Lane (#17)

Re: seg fault crashed the postmaster

Unfortunately it's now impossible to say how many were updated, as they get
deleted by another process later. I may be able to restore part of a dump
from 2 days ago on another machine, and get some counts from that, assuming
I have the disk space. I'll work on that.

I do not believe there could have been more than 2 concurrent updates
because the app is limited to 2 threads, but I will verify that with the
developer.

-gordon

On Fri, Dec 31, 2010 at 3:43 PM, Tom Lane-2 [via PostgreSQL] <
ml-node+3323971-508486184-56124@n5.nabble.com<ml-node%2B3323971-508486184-56124@n5.nabble.com>

wrote:

No luck here either. A couple further questions --- do you think it's
likely that you could have been over 10000 matches, ie, the LIMIT
actually kicked in? Also, could there have been more than two
concurrent updates acting on the same rows?

--
View this message in context: http://postgresql.1045698.n5.nabble.com/seg-fault-crashed-the-postmaster-tp3323117p3324015.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.

#19

Gordon Shannon

gordo169@gmail.com

over 15 years ago

In reply to: Gordon Shannon (#18)

Re: seg fault crashed the postmaster

I'm putting this on this thread, since it could be related to the issue.

I'm now seeing this in the log on the HSB/SR server. It's happened about 4
times in the past 2 days.

23964 2011-01-04 05:23:00 EST [47]LOG: invalid record length at
6E53/46E8A010
23535 2011-01-04 05:23:00 EST [2]FATAL: terminating walreceiver process
due to administrator command
cp: cannot stat `/data23/wal_sync/0000000100006E5300000046': No such file or
directory
23964 2011-01-04 05:23:00 EST [48]LOG: invalid record length at
6E53/46E8A010
cp: cannot stat `/data23/wal_sync/0000000100006E5300000046': No such file or
directory
2486 2011-01-04 05:23:00 EST [1]LOG: streaming replication successfully
connected to primary
23964 2011-01-04 15:47:59 EST [49]LOG: invalid record length at
6E6B/F8222010
2486 2011-01-04 15:47:59 EST [2]FATAL: terminating walreceiver process
due to administrator command
cp: cannot stat `/data23/wal_sync/0000000100006E6B000000F8': No such file or
directory
23964 2011-01-04 15:48:00 EST [50]LOG: invalid record length at
6E6B/F8222010
cp: cannot stat `/data23/wal_sync/0000000100006E6B000000F8': No such file or
directory

The word FATAL sounds very ominous.

--
View this message in context: http://postgresql.1045698.n5.nabble.com/seg-fault-crashed-the-postmaster-tp3323117p3328138.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.

#20

u235sentinel

u235sentinel@gmail.com

over 15 years ago

In reply to: Gordon Shannon (#19)

beginners autovacuum question

I'm tracking a problem with our tables being bloated and was curious if
someone regularly kills autovacuum jobs, will autovacuum later reattempt
to vacuum the table it was killed under?

I've made autovacuum more aggressive and given more worker threads. Yet
for some reason we're not keeping up. The tables on this system are
very active (at least many of them are). We have a large database
(above 3.5 TB at this time) on a Sun Solaris Thumper (sunfire 4500
series I believe).

Vacuum does complete when I run it manually on a table. But I'm
suspecting a coworker working late at night be my killing autovacuum.
Reading through the logs right now to see if this is the case.

Thanks!

#21

Devrim GÜNDÜZ

devrim@gunduz.org

over 15 years ago

In reply to: u235sentinel (#20)

#22

Scott Marlowe

scott.marlowe@gmail.com

over 15 years ago

In reply to: u235sentinel (#20)