Segfault in backend CTE code

Started by Phil Sorberabout 14 years ago9 messagesbugs
Jump to latest
#1Phil Sorber
phil@omniti.com

Running Postgres 9.1.2.

I've attached a backtrace. Looking at the backtrace it looks like
ExecGetResultType() gets called with a NULL planstate and causes the
segmentation fault:

https://github.com/postgres/postgres/blob/master/src/backend/executor/execUtils.c#L470

Following the stack I see that an optimization for writeable CTE's
inserts a NULL subplanstate:

https://github.com/postgres/postgres/blob/master/src/backend/executor/execMain.c#L2344

ExecInitCteScan() is what eventually passes it to ExecGetResultType():

https://github.com/postgres/postgres/blob/master/src/backend/executor/nodeCtescan.c#L255

I've also attached a proposed fix. In this optimized case it says that
we won't ever use the subplan anyway, so I figured that not setting
the scan tuple type won't matter. I also added an Assert() to
ExecGetResultType(). I modified the declaration of 'slot' to remove a
compiler warning. This patch is against master but should backport to
9.1 cleanly. It also passed all regression tests. If you end up using
this patch please also credit Rick Pufky who helped me with this.

Attachments:

backtrace.txttext/plain; charset=US-ASCII; name=backtrace.txtDownload
fix_CTE_NULL_PTR_deref.patchtext/x-patch; charset=US-ASCII; name=fix_CTE_NULL_PTR_deref.patchDownload+10-8
#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Phil Sorber (#1)
Re: Segfault in backend CTE code

Phil Sorber <phil@omniti.com> writes:

I've attached a backtrace.

How about a test case? I have no faith in the proposed patch without
some evidence of what it's supposed to fix.

regards, tom lane

#3Phil Sorber
phil@omniti.com
In reply to: Tom Lane (#2)
Re: Segfault in backend CTE code

On Tue, Jan 24, 2012 at 12:43 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Phil Sorber <phil@omniti.com> writes:

I've attached a backtrace.

How about a test case?  I have no faith in the proposed patch without
some evidence of what it's supposed to fix.

We are having trouble coming up with a test case that can reliably
reproduce this. I has happened 5 times since Friday. There was another
one this morning and we have more logging turned on now and I have
attached more information. Is there anything else we can do to help
debug it? This is a production machine and we may be forced into
applying the patch we came up with, but we wanted to have as good of a
solution as possible.

                       regards, tom lane

I have a new backtrace which is almost identical to the old one. Only
memory addresses differ. I am attaching it anyway. I am also attaching
the pertinent logs with sensitive information changed. The log lines
are just the entries for the PID that segfaulted from right before it
happened. The PID was long lived but nothing was going on for minutes
of time before the segfault. I am also attaching the source of the
function that was called right before the segfault. Again with
sensitive information changed.

Thanks.

Attachments:

backtrace2.txttext/plain; charset=US-ASCII; name=backtrace2.txtDownload
segfault_logs.txttext/plain; charset=US-ASCII; name=segfault_logs.txtDownload
segfault_func.txttext/plain; charset=US-ASCII; name=segfault_func.txtDownload
#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Phil Sorber (#3)
Re: Segfault in backend CTE code

Phil Sorber <phil@omniti.com> writes:

On Tue, Jan 24, 2012 at 12:43 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

How about a test case?

We are having trouble coming up with a test case that can reliably
reproduce this.

The stack traces run through the EvalPlanQual machinery, which is only
going to be entered when attempting to update/delete a row that's been
updated by a concurrent transaction. So what I'd suggest for creating a
test case is

(1) in a psql session, do
BEGIN;
UPDATE some-target-row;

(2) in another psql session, call this function with arguments
that will make it try to update that same row; it should
block.

(3) in the first session, COMMIT to unblock.

FWIW, having re-examined your patch with some caffeine in me, I don't
think it's right at all. You can't just blow off setting the scan type
for a CTEScan node. What it looks like to me is that the EvalPlanQual
code is not initializing the new execution tree correctly; but that
idea would be a lot easier to check into with a test case.

regards, tom lane

#5Phil Sorber
phil@omniti.com
In reply to: Tom Lane (#4)
Re: Segfault in backend CTE code

On Tue, Jan 24, 2012 at 4:03 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Phil Sorber <phil@omniti.com> writes:

On Tue, Jan 24, 2012 at 12:43 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

How about a test case?

We are having trouble coming up with a test case that can reliably
reproduce this.

The stack traces run through the EvalPlanQual machinery, which is only
going to be entered when attempting to update/delete a row that's been
updated by a concurrent transaction.  So what I'd suggest for creating a
test case is

       (1) in a psql session, do
               BEGIN;
               UPDATE some-target-row;

       (2) in another psql session, call this function with arguments
           that will make it try to update that same row; it should
           block.

       (3) in the first session, COMMIT to unblock.

That helped a lot. I now have a simple test case that I can reliably
re-produce the segfault and now also a patch that fixes it. I had to
modify the patch slightly because while it fixed the first problem, it
just cascaded to another NULL deref from the same root cause. Both are
attached.

FWIW, having re-examined your patch with some caffeine in me, I don't
think it's right at all.  You can't just blow off setting the scan type
for a CTEScan node.  What it looks like to me is that the EvalPlanQual
code is not initializing the new execution tree correctly; but that
idea would be a lot easier to check into with a test case.

If I understand what you are saying, then I agree that is the root of
the problem. The comment label's it as an optimization, but then later
fails to account for all the changes needed. My patch accounts for at
least one extra change that is needed. We could also remove the
"optimization" but I assumed it was there for a reason, especially
given the fact that someone took the time to make a comment about it.

The change was made in this commit by you:
http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;f=src/backend/executor/execMain.c;h=389af951552ff2209eae3e62fa147fef12329d4f

Show quoted text

                       regards, tom lane

Attachments:

CTE_segfault_test_case.txttext/plain; charset=US-ASCII; name=CTE_segfault_test_case.txtDownload
fix_CTE_NULL_PTR_deref_v2.patchtext/x-patch; charset=US-ASCII; name=fix_CTE_NULL_PTR_deref_v2.patchDownload+30-24
#6Phil Sorber
phil@omniti.com
In reply to: Phil Sorber (#5)
Re: Segfault in backend CTE code

I screwed up cut and paste when putting the test case together. The
reference to table user_data should be t1.

Show quoted text

On Wed, Jan 25, 2012 at 12:47 PM, Phil Sorber <phil@omniti.com> wrote:

On Tue, Jan 24, 2012 at 4:03 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Phil Sorber <phil@omniti.com> writes:

On Tue, Jan 24, 2012 at 12:43 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

How about a test case?

We are having trouble coming up with a test case that can reliably
reproduce this.

The stack traces run through the EvalPlanQual machinery, which is only
going to be entered when attempting to update/delete a row that's been
updated by a concurrent transaction.  So what I'd suggest for creating a
test case is

       (1) in a psql session, do
               BEGIN;
               UPDATE some-target-row;

       (2) in another psql session, call this function with arguments
           that will make it try to update that same row; it should
           block.

       (3) in the first session, COMMIT to unblock.

That helped a lot. I now have a simple test case that I can reliably
re-produce the segfault and now also a patch that fixes it. I had to
modify the patch slightly because while it fixed the first problem, it
just cascaded to another NULL deref from the same root cause. Both are
attached.

FWIW, having re-examined your patch with some caffeine in me, I don't
think it's right at all.  You can't just blow off setting the scan type
for a CTEScan node.  What it looks like to me is that the EvalPlanQual
code is not initializing the new execution tree correctly; but that
idea would be a lot easier to check into with a test case.

If I understand what you are saying, then I agree that is the root of
the problem. The comment label's it as an optimization, but then later
fails to account for all the changes needed. My patch accounts for at
least one extra change that is needed. We could also remove the
"optimization" but I assumed it was there for a reason, especially
given the fact that someone took the time to make a comment about it.

The change was made in this commit by you:
http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;f=src/backend/executor/execMain.c;h=389af951552ff2209eae3e62fa147fef12329d4f

                       regards, tom lane

Attachments:

CTE_segfault_test_case_v2.txttext/plain; charset=US-ASCII; name=CTE_segfault_test_case_v2.txtDownload
#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Phil Sorber (#5)
Re: Segfault in backend CTE code

Phil Sorber <phil@omniti.com> writes:

That helped a lot. I now have a simple test case that I can reliably
re-produce the segfault and now also a patch that fixes it.

[ pokes at that... ] Hmm. I think what this proves is that that
optimization in EvalPlanQualStart is just too cute for its own good,
and that the only safe fix is to take it out. That code path is
(obviously) none too well tested, so we should not have it setting
up execution tree structures that are not like those used normally.
It's just begging for trouble.

regards, tom lane

#8Phil Sorber
phil@omniti.com
In reply to: Tom Lane (#7)
Re: Segfault in backend CTE code

On Wed, Jan 25, 2012 at 5:13 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Phil Sorber <phil@omniti.com> writes:

That helped a lot. I now have a simple test case that I can reliably
re-produce the segfault and now also a patch that fixes it.

[ pokes at that... ]  Hmm.  I think what this proves is that that
optimization in EvalPlanQualStart is just too cute for its own good,
and that the only safe fix is to take it out.  That code path is
(obviously) none too well tested, so we should not have it setting
up execution tree structures that are not like those used normally.
It's just begging for trouble.

I played around with removing the optimization, but there are other
pieces further down the line that are upset but having a ModifyTable
node in the execution tree. Specifically this:

http://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/executor/nodeModifyTable.c;h=cf32dc569037f710ce6c43c4c93ee3a10cabe085;hb=389af951552ff2209eae3e62fa147fef12329d4f#l900

Not sure at all how to proceed from there so I am deferring. Perhaps
the original authors can be asked to weigh in? We changed our writable
CTE queries to update/insert loops so this is no longer a blocker for
us.

Show quoted text

                       regards, tom lane

#9Tom Lane
tgl@sss.pgh.pa.us
In reply to: Phil Sorber (#8)
Re: Segfault in backend CTE code

Phil Sorber <phil@omniti.com> writes:

On Wed, Jan 25, 2012 at 5:13 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
I played around with removing the optimization, but there are other
pieces further down the line that are upset but having a ModifyTable
node in the execution tree.

Hm, yeah, obviously this scenario has never been tested :-(. I have
applied a patch for it, and also done some work so that it will get
tested by the buildfarm in future. Thanks for the report!

We changed our writable CTE queries to update/insert loops so this is
no longer a blocker for us.

FWIW, that technique didn't really work anyway, as even with the patch
I observe that you get a "duplicate unique key" failure if two such
commands try to insert a new row concurrently. This is because neither
UPDATE subquery will see an as-yet-uncommitted inserted row.

regards, tom lane