Segfault on exclusion constraint violation

Started by Heikki Linnakangasabout 11 years ago6 messagesbugs
Jump to latest
#1Heikki Linnakangas
heikki.linnakangas@enterprisedb.com

9.4 and master segfaults, if an insertion would need to wait for another
transaction to finish because of an exclusion constraint. To reproduce:

Run these in session A:

create extension btree_gist;
create table foo (i int4, constraint i_exclude exclude using gist (i
with =));
begin; insert into foo values (1);

leave the transaction open, and session B:

insert into foo values (1);

LOG: server process (PID 3690) was terminated by signal 11:
Segmentation fault
DETAIL: Failed process was running: insert into foo values (1);
LOG: terminating any other active server processes

gdb backtrace:

#0 0x000000000078520d in XactLockTableWait (xid=705, rel=0x7f2f6e835728,
ctid=0x7f7f7f7f7f7f7f8b, oper=XLTW_RecheckExclusionConstr) at
lmgr.c:515
#1 0x000000000064bd86 in check_exclusion_constraint (heap=0x7f2f6e835728,
index=0x7f2f6e837620, indexInfo=0x22187c0, tupleid=0x2219514,
values=0x7fffae880a10, isnull=0x7fffae8809f0 "", estate=0x2218228,
newIndex=0 '\000', errorOK=0 '\000') at execUtils.c:1310
#2 0x000000000064b9a9 in ExecInsertIndexTuples (slot=0x2218500,
tupleid=0x2219514, estate=0x2218228) at execUtils.c:1126
#3 0x000000000065f8c4 in ExecInsert (slot=0x2218500, planSlot=0x2218500,
estate=0x2218228, canSetTag=1 '\001') at nodeModifyTable.c:274

This only happens with assertions enabled. The culprit is commit
f88d4cfc9d417dac2ee41a8f5e593898e56fd2bd, which added the 'ctid'
argument to XactLockTableWait. check_exclusion_constraint calls
index_endscan() just before XactLockTableWait, but that free's the
memory the ctid points to.

The fix for this particular instance is trivial: copy the ctid to a
local variable before calling index_endscan. However, looking at the
other XactLockTableWait() and MultiXactIdWait() calls, there are more
questionable pointers being passed. Most point to heap tuples on disk
pages, after releasing the lock on the page, although not the pin. The
one in EvalPlanQualFetch releases the pin too.

I'll write up a patch to change those call sites to use local variables.
Hopefully it's trivial enough to still include in 9.4.1, although time
is really running out..

- Heikki

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#2Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Heikki Linnakangas (#1)
Re: Segfault on exclusion constraint violation

On 02/02/2015 03:50 PM, Heikki Linnakangas wrote:

This only happens with assertions enabled. The culprit is commit
f88d4cfc9d417dac2ee41a8f5e593898e56fd2bd, which added the 'ctid'
argument to XactLockTableWait. check_exclusion_constraint calls
index_endscan() just before XactLockTableWait, but that free's the
memory the ctid points to.

The fix for this particular instance is trivial: copy the ctid to a
local variable before calling index_endscan. However, looking at the
other XactLockTableWait() and MultiXactIdWait() calls, there are more
questionable pointers being passed. Most point to heap tuples on disk
pages, after releasing the lock on the page, although not the pin. The
one in EvalPlanQualFetch releases the pin too.

I'll write up a patch to change those call sites to use local variables.
Hopefully it's trivial enough to still include in 9.4.1, although time
is really running out..

I'll commit the attached fix shortly, so please shout quickly if you see
a problem with this.

Aside from the potential for segfaults with assertions, I think the
calls passed incorrect ctid anyway. For example:

--- a/src/backend/executor/execUtils.c
+++ b/src/backend/executor/execUtils.c
@@ -1307,7 +1307,7 @@ retry:
if (TransactionIdIsValid(xwait))
{
index_endscan(index_scan);
-                       XactLockTableWait(xwait, heap, &tup->t_data->t_ctid,
+                       XactLockTableWait(xwait, heap, &tup->t_self,
XLTW_RecheckExclusionConstr);
goto retry;
}

We don't really want to pass the heap tuple's ctid field. If the tuple
was updated (again) in the same transaction, the one that's still
in-progress, that points to the *next* tuple in the chain, but the error
message says "while checking exclusion constraint on tuple (%u,%u) in
relation \"%s\"". We should be passing the TID of the tuple itself, not
the ctid value in the tuple's header. The attached patch fixes that too.

- Heikki

Attachments:

0001-Fix-reference-after-free-when-waiting-for-another-xa.patchtext/x-diff; name=0001-Fix-reference-after-free-when-waiting-for-another-xa.patchDownload+14-13
#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Heikki Linnakangas (#1)
Re: Segfault on exclusion constraint violation

Heikki Linnakangas <hlinnakangas@vmware.com> writes:

9.4 and master segfaults, if an insertion would need to wait for another
transaction to finish because of an exclusion constraint. To reproduce:
...
This only happens with assertions enabled. The culprit is commit
f88d4cfc9d417dac2ee41a8f5e593898e56fd2bd, which added the 'ctid'
argument to XactLockTableWait. check_exclusion_constraint calls
index_endscan() just before XactLockTableWait, but that free's the
memory the ctid points to.

I'll write up a patch to change those call sites to use local variables.
Hopefully it's trivial enough to still include in 9.4.1, although time
is really running out..

If the only known bad consequence requires assertions enabled, I think
it would be more prudent to *not* try to fix this in haste.

regards, tom lane

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#4Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Tom Lane (#3)
Re: Segfault on exclusion constraint violation

On 02/02/2015 04:38 PM, Tom Lane wrote:

Heikki Linnakangas <hlinnakangas@vmware.com> writes:

9.4 and master segfaults, if an insertion would need to wait for another
transaction to finish because of an exclusion constraint. To reproduce:
...
This only happens with assertions enabled. The culprit is commit
f88d4cfc9d417dac2ee41a8f5e593898e56fd2bd, which added the 'ctid'
argument to XactLockTableWait. check_exclusion_constraint calls
index_endscan() just before XactLockTableWait, but that free's the
memory the ctid points to.

I'll write up a patch to change those call sites to use local variables.
Hopefully it's trivial enough to still include in 9.4.1, although time
is really running out..

If the only known bad consequence requires assertions enabled, I think
it would be more prudent to *not* try to fix this in haste.

Ok, I'll wait until after the release.

- Heikki

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#5Dennis Pozzi
dpozzi@adobe.com
In reply to: Heikki Linnakangas (#4)
Re: Segfault on exclusion constraint violation

We are seeing a similar segfault scenario in 9.4.1 without assertions enabled. We upgraded 4 of our production instances from 9.3.5 to 9.4.1 over the weekend.
We had staged on 9.4.0, and then 9.4.1 for over a month without seeing this error, but the staging work load is much lower than production.

postgres=# show debug_assertions;
debug_assertions
------------------
off
(1 row)

postgres=# set debug_assertions = 'on' ;
ERROR: assertion checking is not supported by this build
postgres=#

We believe the error occurred when an insert query was running inside a transaction, and another query attempted to insert into the same table.

Postgres Log says :
2015-03-16 02:43:25.762 PDT,,,5481,,5504121c.1569,3,,2015-03-14 03:49:00 PDT,,0,LOG,00000,"server process (PID 9471) was terminated by signal 11: Segmentation fault","Failed process was running: INSERT into ad_instances (adinstid, cid, kid, adrefid, termid, sync_match_type,
status_code, start_date, sync_keyword)
SELECT nextval('public.ad_instances_adinstid_seq'), cid, kw.kid, adg.adrefid,
termid, ai_mtid, 'u', now(), sync_keyword
FROM tmp_keywords_1100218810
JOIN keywords kw using (keyword)
JOIN adgroups adg using (adref,cid)
WHERE keyword_type = 'n'
AND cid = 1100218810
GROUP BY cid, kw.kid, adg.adrefid, termid, ai_mtid, sync_keyword, tmp_keywords_1100218810.status_code",,,,,,,,""

Application log says:
2015-03-16 02:43:25.763 PDT,"release","c10036",4349,"op02.lon5.efrontier.com:39404",5506a408.10fd,27,"idle in transaction",2015-03-16 02:36:08 PDT,44/292306,3267900389,WARNING,57P02,"terminating connection because of crash of another server process","The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.","In a moment you should be able to reconnect to the database and repeat your command.",,,,,,,"ad_status"

/var/log/messages says:
Mar 16 02:40:35 user27 kernel: : postgres[9471]: segfault at 30 ip 000000000066148b sp 00007fffa9c5f5f0 error 4 in postgres[400000+54d000]

We have not been able to reproduce the error, but we are testing scenarios in our staging environment now.

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#6Michael Paquier
michael@paquier.xyz
In reply to: Dennis Pozzi (#5)
Re: Segfault on exclusion constraint violation

On Wed, Mar 18, 2015 at 12:15 AM, Dennis Pozzi <dpozzi@adobe.com> wrote:

We have not been able to reproduce the error, but we are testing scenarios in our staging environment now.

If this is the same problem, this will be fixed in the next minor
release as the fix has been committed here after 9.4.1 was released:
commit: 57fe246890ad51e166fb6a8da937e41c35d7a279
author: Heikki Linnakangas <heikki.linnakangas@iki.fi>
date: Wed, 4 Feb 2015 16:00:34 +0200
Fix reference-after-free when waiting for another xact due to constraint.

Now, if you are able to produce a self-contained test case that fails
even on HEAD, well that's another story...
Regards,
--
Michael

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs