exec_execute_message crush

Started by Tatsuo Ishiiabout 16 years ago18 messages
#1Tatsuo Ishii
ishii@postgresql.org

While inspecting a complain from a pgpool user, I found that
PostgreSQL crushes with following statck trace:

#0 0x0826436a in list_length (l=0xaabe4e28)
at ../../../src/include/nodes/pg_list.h:94
#1 0x08262168 in IsTransactionStmtList (parseTrees=0xaabe4e28)
at postgres.c:2429
#2 0x0826132e in exec_execute_message (portal_name=0x857bab0 "", max_rows=0)
at postgres.c:1824
#3 0x08263b2a in PostgresMain (argc=4, argv=0x84f6c28,
username=0x84f6b08 "t-ishii") at postgres.c:3671
#4 0x0823299e in BackendRun (port=0x8511e68) at postmaster.c:3449
#5 0x08231f78 in BackendStartup (port=0x8511e68) at postmaster.c:3063
#6 0x0822f90a in ServerLoop () at postmaster.c:1387
#7 0x0822f131 in PostmasterMain (argc=3, argv=0x84f4bf8) at postmaster.c:1040
#8 0x081c6217 in main (argc=3, argv=0x84f4bf8) at main.c:188

This happens with following extended commands sequence:

parse
bind
describe
execute
<normaly done>
parse invalid SQL thus abort a transaction
bind (error)
describe (error)
execute (crush)

exec_execute_message crushes here:

/* Does the portal contain a transaction command? */
is_xact_command = IsTransactionStmtList(portal->stmts);

Looking into portal:

$5 = {name = 0x85727bc "", prepStmtName = 0x0, heap = 0x8596798, resowner = 0x0,
cleanup = 0, createSubid = 1,
sourceText = 0x859ac78 " SELECT NULL AS TABLE_CAT, n.nspname AS TABLE_SCHEM, ct.relname AS TABLE_NAME, a.attname AS COLUMN_NAME, a.attnum AS KEY_SEQ, ci.relname AS PK_NAME FROM pg_catalog.pg_namespace n, pg_catalog.pg_c"...,
commandTag = 0x84682aa "SELECT", stmts = 0xaabe4e28, cplan = 0x0,
portalParams = 0x0, strategy = PORTAL_ONE_SELECT, cursorOptions = 4,
status = PORTAL_READY, queryDesc = 0x0, tupDesc = 0x85db060,
formats = 0x859b0c8, holdStore = 0x0, holdContext = 0x0, atStart = 1 '\001',
atEnd = 1 '\001', posOverflow = 0 '\0', portalPos = 0,
creation_time = 315313855337710, visible = 1 '\001'}

Problem is, stmts points to invalid memory address:

(gdb) p *portal->stmts
Cannot access memory at address 0xaabe4e28

It seems the source of the problem is, exec_execute_message tries to
execute unamed portal which has unnamed statement which has already
gone.

Please note that without pgpool backend does not crush. This is
because JDBC driver does not do execute() if prior parse, bind
etc. failed, I think.

The crush happens PostgreSQL 8.3.8, 8.3.9 and 8.4.2.

Any thought?
--
Tatsuo Ishii
SRA OSS, Inc. Japan

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tatsuo Ishii (#1)
Re: exec_execute_message crush

Tatsuo Ishii <ishii@postgresql.org> writes:

It seems the source of the problem is, exec_execute_message tries to
execute unamed portal which has unnamed statement which has already
gone.

Could we see an actual test case?

regards, tom lane

#3Tatsuo Ishii
ishii@postgresql.org
In reply to: Tom Lane (#2)
Re: exec_execute_message crush

Tatsuo Ishii <ishii@postgresql.org> writes:

It seems the source of the problem is, exec_execute_message tries to
execute unamed portal which has unnamed statement which has already
gone.

Could we see an actual test case?

If you don't mind to use pgpool, it would be possible. If not, I need
to write a small program which handles frontend/backend protocol
directly. What shall I do?
--
Tatsuo Ishii
SRA OSS, Inc. Japan

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tatsuo Ishii (#3)
Re: exec_execute_message crush

Tatsuo Ishii <ishii@postgresql.org> writes:

Could we see an actual test case?

If you don't mind to use pgpool, it would be possible. If not, I need
to write a small program which handles frontend/backend protocol
directly. What shall I do?

Hm, can't you get libpq to do it?

regards, tom lane

#5Tatsuo Ishii
ishii@postgresql.org
In reply to: Tom Lane (#4)
Re: exec_execute_message crush

If you don't mind to use pgpool, it would be possible. If not, I need
to write a small program which handles frontend/backend protocol
directly. What shall I do?

Hm, can't you get libpq to do it?

That depends on how libpq is "intelligent":-) Let me try...

Another idea is a "packet recorder", which could record packets from
pgpool to PostgreSQL and replay them. I don't remember at present, but
I vaguely recall something like that exists.
--
Tatsuo Ishii
SRA OSS, Inc. Japan

#6Tatsuo Ishii
ishii@postgresql.org
In reply to: Tatsuo Ishii (#5)
Re: exec_execute_message crush

Hm, can't you get libpq to do it?

That depends on how libpq is "intelligent":-) Let me try...

Another idea is a "packet recorder", which could record packets from
pgpool to PostgreSQL and replay them. I don't remember at present, but
I vaguely recall something like that exists.

It seems we can't get libpq to do it. libpq does not provide a
function which can execute bind alone. In my understanding
PQexecPrepared does bind + execute.
--
Tatsuo Ishii
SRA OSS, Inc. Japan

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tatsuo Ishii (#6)
Re: exec_execute_message crush

Tatsuo Ishii <ishii@postgresql.org> writes:

Hm, can't you get libpq to do it?

It seems we can't get libpq to do it. libpq does not provide a
function which can execute bind alone. In my understanding
PQexecPrepared does bind + execute.

The event sequence you mentioned had bind followed by execute, so
I'm not seeing the problem.

(In any case, some kind of quick lobotomy in libpq would be easier
than writing a standalone test program, no?)

regards, tom lane

#8Tatsuo Ishii
ishii@postgresql.org
In reply to: Tom Lane (#7)
Re: exec_execute_message crush

(In any case, some kind of quick lobotomy in libpq would be easier
than writing a standalone test program, no?)

Sounds nice idea.
--
Tatsuo Ishii
SRA OSS, Inc. Japan

#9Kris Jurka
books@ejurka.com
In reply to: Tatsuo Ishii (#1)
1 attachment(s)
Re: exec_execute_message crush

On Tue, 29 Dec 2009, Tatsuo Ishii wrote:

parse
bind
describe
execute
<normaly done>
parse invalid SQL thus abort a transaction
bind (error)
describe (error)
execute (crush)

Please note that without pgpool backend does not crush. This is
because JDBC driver does not do execute() if prior parse, bind
etc. failed, I think.

The JDBC driver will fire away parse, bind, and execute all at once before
a sync, to avoid network roundtrips, so your assumption of what's going on
here without pgpool doesn't seem accurate. Attached is a test case that
tries to duplicate what you've described and it errors out normally.
Below is the JDBC driver's protocol level logging.

21:41:39.407 (1) FE=> Parse(stmt=S_1,query="BEGIN",oids={})
21:41:39.407 (1) FE=> Bind(stmt=S_1,portal=null)
21:41:39.407 (1) FE=> Execute(portal=null,limit=0)
21:41:39.408 (1) FE=> Parse(stmt=null,query="SELECT $1 ",oids={23})
21:41:39.408 (1) FE=> Bind(stmt=null,portal=null,$1=<'1'>)
21:41:39.408 (1) FE=> Describe(portal=null)
21:41:39.408 (1) FE=> Execute(portal=null,limit=0)
21:41:39.408 (1) FE=> Parse(stmt=null,query=" SELECT SELECT $1
",oids={23})
21:41:39.408 (1) FE=> Bind(stmt=null,portal=null,$1=<'2'>)
21:41:39.409 (1) FE=> Describe(portal=null)
21:41:39.409 (1) FE=> Execute(portal=null,limit=0)
21:41:39.409 (1) FE=> Sync
21:41:39.443 (1) <=BE ParseComplete [S_1]
21:41:39.443 (1) <=BE BindComplete [null]
21:41:39.443 (1) <=BE CommandStatus(BEGIN)
21:41:39.443 (1) <=BE ParseComplete [null]
21:41:39.443 (1) <=BE BindComplete [null]
21:41:39.444 (1) <=BE RowDescription(1)
21:41:39.444 (1) <=BE DataRow
21:41:39.444 (1) <=BE CommandStatus(SELECT)
21:41:39.454 (1) <=BE ErrorMessage(ERROR: syntax error at or near
"SELECT"
Position: 9)

So this shows everything working as expected. Perhaps enabling this
logging on your JDBC client would show more clearly what it is trying to
do.

Kris Jurka

Attachments:

Crash.javatext/plain; CHARSET=US-ASCII; NAME=Crash.javaDownload
#10Tatsuo Ishii
ishii@postgresql.org
In reply to: Kris Jurka (#9)
Re: exec_execute_message crush

parse
bind
describe
execute
<normaly done>
parse invalid SQL thus abort a transaction
bind (error)
describe (error)
execute (crush)

Please note that without pgpool backend does not crush. This is
because JDBC driver does not do execute() if prior parse, bind
etc. failed, I think.

The JDBC driver will fire away parse, bind, and execute all at once before
a sync, to avoid network roundtrips, so your assumption of what's going on
here without pgpool doesn't seem accurate. Attached is a test case that
tries to duplicate what you've described and it errors out normally.
Below is the JDBC driver's protocol level logging.

21:41:39.407 (1) FE=> Parse(stmt=S_1,query="BEGIN",oids={})
21:41:39.407 (1) FE=> Bind(stmt=S_1,portal=null)
21:41:39.407 (1) FE=> Execute(portal=null,limit=0)
21:41:39.408 (1) FE=> Parse(stmt=null,query="SELECT $1 ",oids={23})
21:41:39.408 (1) FE=> Bind(stmt=null,portal=null,$1=<'1'>)
21:41:39.408 (1) FE=> Describe(portal=null)
21:41:39.408 (1) FE=> Execute(portal=null,limit=0)
21:41:39.408 (1) FE=> Parse(stmt=null,query=" SELECT SELECT $1
",oids={23})
21:41:39.408 (1) FE=> Bind(stmt=null,portal=null,$1=<'2'>)
21:41:39.409 (1) FE=> Describe(portal=null)
21:41:39.409 (1) FE=> Execute(portal=null,limit=0)
21:41:39.409 (1) FE=> Sync
21:41:39.443 (1) <=BE ParseComplete [S_1]
21:41:39.443 (1) <=BE BindComplete [null]
21:41:39.443 (1) <=BE CommandStatus(BEGIN)
21:41:39.443 (1) <=BE ParseComplete [null]
21:41:39.443 (1) <=BE BindComplete [null]
21:41:39.444 (1) <=BE RowDescription(1)
21:41:39.444 (1) <=BE DataRow
21:41:39.444 (1) <=BE CommandStatus(SELECT)
21:41:39.454 (1) <=BE ErrorMessage(ERROR: syntax error at or near
"SELECT"
Position: 9)

So this shows everything working as expected. Perhaps enabling this
logging on your JDBC client would show more clearly what it is trying to
do.

Thanks for clarification. I will look into more between pgpool and
PostgreSQL packet exchange.
--
Tatsuo Ishii
SRA OSS, Inc. Japan

#11Tatsuo Ishii
ishii@postgresql.org
In reply to: Tatsuo Ishii (#1)
Re: exec_execute_message crash

While inspecting a complain from a pgpool user, I found that
PostgreSQL crushes with following statck trace:

#0 0x0826436a in list_length (l=0xaabe4e28)
at ../../../src/include/nodes/pg_list.h:94
#1 0x08262168 in IsTransactionStmtList (parseTrees=0xaabe4e28)
at postgres.c:2429
#2 0x0826132e in exec_execute_message (portal_name=0x857bab0 "", max_rows=0)
at postgres.c:1824
#3 0x08263b2a in PostgresMain (argc=4, argv=0x84f6c28,
username=0x84f6b08 "t-ishii") at postgres.c:3671
#4 0x0823299e in BackendRun (port=0x8511e68) at postmaster.c:3449
#5 0x08231f78 in BackendStartup (port=0x8511e68) at postmaster.c:3063
#6 0x0822f90a in ServerLoop () at postmaster.c:1387
#7 0x0822f131 in PostmasterMain (argc=3, argv=0x84f4bf8) at postmaster.c:1040
#8 0x081c6217 in main (argc=3, argv=0x84f4bf8) at main.c:188

Ok, I think I understand what's going on.

parse
bind
describe
execute

This sequence of commands create cached plan in unnamed portal.

$5 = {name = 0x8574de4 "", prepStmtName = 0x0, heap = 0x8598400,
resowner = 0x8598488, cleanup = 0x81632ca <PortalCleanup>, createSubid = 1,
sourceText = 0x85ab818 " SELECT <omitted>"...,
commandTag = 0x84682ca "SELECT", stmts = 0xaabf43b0, cplan = 0xaabf4950,
portalParams = 0x0, strategy = PORTAL_ONE_SELECT, cursorOptions = 4,
status = PORTAL_READY, queryDesc = 0x85abc20, tupDesc = 0x85ddcb0,
formats = 0x85abc68, holdStore = 0x0, holdContext = 0x0, atStart = 1 '\001',
atEnd = 1 '\001', posOverflow = 0 '\0', portalPos = 0,
creation_time = 315487957498169, visible = 1 '\001'}

The cached plan(portal->cplan) and statements(portal->stmts) are
created by exec_bind_message():

/*
* Revalidate the cached plan; this may result in replanning. Any
* cruft will be generated in MessageContext. The plan refcount will
* be assigned to the Portal, so it will be released at portal
* destruction.
*/
cplan = RevalidateCachedPlan(psrc, false);
plan_list = cplan->stmt_list;

Please note that cplan and stmts belong to the same memory context.

Then following commands are coming:

parse invalid SQL thus abort a transaction
bind (error)
describe (error)
execute (crash)

parse causes transaction to abort, which causes call to
AbortCurrentTransaction->AbortTransaction->AtAbort_portals->ReleaseCachedPlan. It
calls ReleaseCachePlan(portal->cplan). ReleaseCachePlan calls
MemoryContextDelete(plan->context) which destroys both portal->cplan
and portal->stmts.

That was the reason why I had segfault by accessing portal->stmts.

To fix this I think exec_execute_message should throw an error if
portal->cleanup is NULL, since portal->cleanup is NULLed by
AtAbort_Portals at transaction abort (or portal is dropped).

Here is a suggested fix:

diff -c postgres.c~ postgres.c
*** postgres.c~	2009-06-18 19:08:08.000000000 +0900
--- postgres.c	2009-12-30 21:34:49.000000000 +0900
***************
*** 1804,1810 ****
  		dest = DestRemoteExecute;
  	portal = GetPortalByName(portal_name);
! 	if (!PortalIsValid(portal))
  		ereport(ERROR,
  				(errcode(ERRCODE_UNDEFINED_CURSOR),
  				 errmsg("portal \"%s\" does not exist", portal_name)));
--- 1804,1810 ----
  		dest = DestRemoteExecute;

portal = GetPortalByName(portal_name);
! if (!PortalIsValid(portal) || (PortalIsValid(portal) && portal->cleanup == NULL))
ereport(ERROR,
(errcode(ERRCODE_UNDEFINED_CURSOR),
errmsg("portal \"%s\" does not exist", portal_name)));

--
Tatsuo Ishii
SRA OSS, Inc. Japan

#12Andrew Dunstan
andrew@dunslane.net
In reply to: Tatsuo Ishii (#11)
Re: exec_execute_message crash

Tatsuo Ishii wrote:

! if (!PortalIsValid(portal) || (PortalIsValid(portal) && portal->cleanup == NULL))

Surely the second call to PortalIsValid() is redundant.

if (( !PortalIsValid(portal)) || portal->cleanup == NULL)

should do it, no?

cheers

andrew

#13Tatsuo Ishii
ishii@postgresql.org
In reply to: Andrew Dunstan (#12)
Re: exec_execute_message crash

Tatsuo Ishii wrote:

! if (!PortalIsValid(portal) || (PortalIsValid(portal) && portal->cleanup == NULL))

Surely the second call to PortalIsValid() is redundant.

if (( !PortalIsValid(portal)) || portal->cleanup == NULL)

should do it, no?

Oops. You are right.
--
Tatsuo Ishii
SRA OSS, Inc. Japan

#14Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tatsuo Ishii (#11)
Re: exec_execute_message crash

Tatsuo Ishii <ishii@postgresql.org> writes:

parse causes transaction to abort, which causes call to
AbortCurrentTransaction->AbortTransaction->AtAbort_portals->ReleaseCachedPlan. It
calls ReleaseCachePlan(portal->cplan). ReleaseCachePlan calls
MemoryContextDelete(plan->context) which destroys both portal->cplan
and portal->stmts.

That was the reason why I had segfault by accessing portal->stmts.

To fix this I think exec_execute_message should throw an error if
portal->cleanup is NULL, since portal->cleanup is NULLed by
AtAbort_Portals at transaction abort (or portal is dropped).

This is just a kluge, and a rather bad one I think. The real problem
here is that AtAbort_Portals destroys the portal contents and doesn't
do anything to record the fact. It should probably be putting the
portal into PORTAL_FAILED state, and what exec_execute_message ought
to be doing is checking for that. It might be a good idea to explicitly
zero out the now-dangling pointers in the Portal struct, too.

It'd be nice to have a test case for this, hint hint ...

regards, tom lane

#15Tatsuo Ishii
ishii@postgresql.org
In reply to: Tom Lane (#14)
Re: exec_execute_message crash

This is just a kluge, and a rather bad one I think. The real problem
here is that AtAbort_Portals destroys the portal contents and doesn't
do anything to record the fact. It should probably be putting the
portal into PORTAL_FAILED state, and what exec_execute_message ought
to be doing is checking for that.

Yeah I thought about that too. in AtAbort_Portals:

--------------------------------------------------------------------------
/*
* Abort processing for portals.
*
* At this point we reset "active" status and run the cleanup hook if
* present, but we can't release the portal's memory until the cleanup call.
*
* The reason we need to reset active is so that we can replace the unnamed
* portal, else we'll fail to execute ROLLBACK when it arrives.
*/
void
AtAbort_Portals(void)
{
HASH_SEQ_STATUS status;
PortalHashEnt *hentry;

hash_seq_init(&status, PortalHashTable);

while ((hentry = (PortalHashEnt *) hash_seq_search(&status)) != NULL)
{
Portal portal = hentry->portal;

if (portal->status == PORTAL_ACTIVE)
portal->status = PORTAL_FAILED;
--------------------------------------------------------------------------

Should I change the last if clause to?

if (portal->status == PORTAL_ACTIVE || portal->status == PORTAL_READY)
portal->status = PORTAL_FAILED;

zero out the now-dangling pointers in the Portal struct, too.

portal->cplan is already zero out by PortalReleaseCachedPlan. Problem
is, portal->stmts may belong to PortalContext or others (in this
particluar case). So if we want to zero out portal->stmts, we need to
memorize the memory context which it belongs to and we need add a new
struct member to portal. I'm afraid this is an overkill...

It'd be nice to have a test case for this, hint hint ...

Still working on...
--
Tatsuo Ishii
SRA OSS, Inc. Japan

#16Tatsuo Ishii
ishii@postgresql.org
In reply to: Tatsuo Ishii (#15)
2 attachment(s)
Re: exec_execute_message crash

It'd be nice to have a test case for this, hint hint ...

Still working on...

Done. Inclded are C test program along with modified fe-exec.c.

The modification made to fe-exec.c is sending Sync after Parse, Bind
and Describe. Pgpool-II does this in order to get current transaction
status.
--
Tatsuo Ishii
SRA OSS, Inc. Japan

Attachments:

test2.ctext/plain; charset=us-asciiDownload
fe-exec.ctext/plain; charset=us-asciiDownload
#17Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tatsuo Ishii (#16)
Re: exec_execute_message crash

Tatsuo Ishii <ishii@postgresql.org> writes:

Done. Inclded are C test program along with modified fe-exec.c.

The modification made to fe-exec.c is sending Sync after Parse, Bind
and Describe. Pgpool-II does this in order to get current transaction
status.

I tried this but didn't have any luck crashing the backend. libpq gets
tremendously confused by the extra ReadyForQuery responses, which is
unsurprising. The postmaster log shows

LOG: could not send data to client: Broken pipe
ERROR: relation "foo" does not exist at character 15
STATEMENT: SELECT * FROM foo
ERROR: unnamed prepared statement does not exist
ERROR: current transaction is aborted, commands ignored until end of transaction block
ERROR: current transaction is aborted, commands ignored until end of transaction block
STATEMENT: SELECT NULL , n.nspname, ct.relname, a.attname, a.attnum, ci.relname FROM pg_catalog.pg_namespace n, pg_catalog.pg_class ct, pg_catalog.pg_class ci, pg_catalog.pg_attribute a, pg_catalog.pg_index i WHERE ct.oid=i.indrelid AND ci.oid=i.indexrelid AND a.attrelid=ci.oid AND i.indisprimary AND ct.relname = 'mst_Ucompany_feature_setting' AND ct.relnamespace = n.oid AND n.nspname = 'foo' ORDER BY 1, 2, 3

So the "unnamed prepared statement does not exist" bit seems to be
related to what you are talking about, but it doesn't actually fail.

regards, tom lane

#18Tatsuo Ishii
ishii@postgresql.org
In reply to: Tom Lane (#17)
Re: exec_execute_message crash

I tried this but didn't have any luck crashing the backend. libpq gets
tremendously confused by the extra ReadyForQuery responses, which is
unsurprising. The postmaster log shows

LOG: could not send data to client: Broken pipe
ERROR: relation "foo" does not exist at character 15
STATEMENT: SELECT * FROM foo
ERROR: unnamed prepared statement does not exist
ERROR: current transaction is aborted, commands ignored until end of transaction block
ERROR: current transaction is aborted, commands ignored until end of transaction block
STATEMENT: SELECT NULL , n.nspname, ct.relname, a.attname, a.attnum, ci.relname FROM pg_catalog.pg_namespace n, pg_catalog.pg_class ct, pg_catalog.pg_class ci, pg_catalog.pg_attribute a, pg_catalog.pg_index i WHERE ct.oid=i.indrelid AND ci.oid=i.indexrelid AND a.attrelid=ci.oid AND i.indisprimary AND ct.relname = 'mst_Ucompany_feature_setting' AND ct.relnamespace = n.oid AND n.nspname = 'foo' ORDER BY 1, 2, 3

So the "unnamed prepared statement does not exist" bit seems to be
related to what you are talking about, but it doesn't actually fail.

I have put some debugging codes to make sure that portal->cplan and
portal->stmts belong to the same memory context by calling
GetMemoryChunkContext and surely they did. It appears that the memory
was surely deleted by MemeoryContextDelete in ReleaseCachedPlan. Also
I defined CLOBBER_FREED_MEMORY in aset.c to fill with 0x7f the freed
memory. Strange thing was portal->smts was not clobbered by 0x7f.
It seems I have missed something here...
--
Tatsuo Ishii
SRA OSS, Inc. Japan