pika buildfarm member failure on isolationCheck tests

Started by Rémi Zaraover 14 years ago9 messages
#1Rémi Zara
remi_zara@mac.com
1 attachment(s)

Hi,

Pika failed at the isolationCheck phase, hitting an assertion:

TRAP: FailedAssertion("!(!((oldSerXidControl->tailXid) != ((TransactionId) 0)) || TransactionIdFollows(xid, oldSerXidControl->tailXid))", File: "predicate.c", Line: 991)

see http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=pika&dt=2011-06-17%2015%3A45%3A30

It seems that for this stage, the server log (which contains the failed assertion) is not sent back to the buildfarm server. Maybe that should be fixed ?

Regards,

Rémi Zara

Attachments:

smime.p7sapplication/pkcs7-signature; name=smime.p7sDownload
#2Andrew Dunstan
andrew@dunslane.net
In reply to: Rémi Zara (#1)
Re: pika buildfarm member failure on isolationCheck tests

On 06/18/2011 11:57 AM, R�mi Zara wrote:

Hi,

Pika failed at the isolationCheck phase, hitting an assertion:

TRAP: FailedAssertion("!(!((oldSerXidControl->tailXid) != ((TransactionId) 0)) || TransactionIdFollows(xid, oldSerXidControl->tailXid))", File: "predicate.c", Line: 991)

see http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=pika&dt=2011-06-17%2015%3A45%3A30

It seems that for this stage, the server log (which contains the failed assertion) is not sent back to the buildfarm server. Maybe that should be fixed ?

Change committed. See
<https://github.com/PGBuildFarm/client-code/commit/dd49c04&gt;. It will be
in the next release.

cheers

andrew

#3Robert Haas
robertmhaas@gmail.com
In reply to: Rémi Zara (#1)
Re: pika buildfarm member failure on isolationCheck tests

On Sat, Jun 18, 2011 at 11:57 AM, Rémi Zara <remi_zara@mac.com> wrote:

Pika failed at the isolationCheck phase, hitting an assertion:

TRAP: FailedAssertion("!(!((oldSerXidControl->tailXid) != ((TransactionId) 0)) || TransactionIdFollows(xid, oldSerXidControl->tailXid))", File: "predicate.c", Line: 991)

see http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=pika&amp;dt=2011-06-17%2015%3A45%3A30

It seems that for this stage, the server log (which contains the failed assertion) is not sent back to the buildfarm server. Maybe that should be fixed ?

Is this an open item for 9.1beta3?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#4Dan Ports
drkp@csail.mit.edu
In reply to: Robert Haas (#3)
Re: pika buildfarm member failure on isolationCheck tests

On Sun, Jun 19, 2011 at 09:10:02PM -0400, Robert Haas wrote:

Is this an open item for 9.1beta3?

Yes. I've put it on the list.

The SxactGlobalXmin and its refcount were getting out of sync with the
actual transaction state. This is timing-dependent but I can reproduce it
fairly reliably under concurrent workloads.

It looks the problem comes from the change a couple days ago that
removed the SXACT_FLAG_ROLLED_BACK flag and changed the
SxactIsRolledBack checks to SxactIsDoomed. That's the correct thing to
do everywhere else, but gets us in trouble here. We shouldn't be
ignoring doomed transactions in SetNewSxactGlobalXmin, because they're
not eligible for cleanup until the associated backend aborts the
transaction and calls ReleasePredicateLocks.

However, it isn't as simple as just removing the SxactIsDoomed check
from SetNewSxactGlobalXmin. ReleasePredicateLocks calls
SetNewSxactGlobalXmin *before* it releases the aborted transaction's
SerializableXact (it pretty much has to, because we must drop and
reacquire SerializableXactHashLock in between). But we don't want the
aborted transaction included in the SxactGlobalXmin computation.

It seems like we do need that SXACT_FLAG_ROLLED_BACK after all, even
though it's only set for this brief interval. We need to be able to
distinguish a transaction that's just been marked for death (doomed)
from one that's already called ReleasePredicateLocks.

Dan

--
Dan R. K. Ports MIT CSAIL http://drkp.net/

#5Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Dan Ports (#4)
Re: pika buildfarm member failure on isolationCheck tests

Dan Ports <drkp@csail.mit.edu> wrote:

It looks the problem comes from the change a couple days ago that
removed the SXACT_FLAG_ROLLED_BACK flag and changed the
SxactIsRolledBack checks to SxactIsDoomed. That's the correct
thing to do everywhere else, but gets us in trouble here. We
shouldn't be ignoring doomed transactions in
SetNewSxactGlobalXmin, because they're not eligible for cleanup
until the associated backend aborts the transaction and calls
ReleasePredicateLocks.

However, it isn't as simple as just removing the SxactIsDoomed
check from SetNewSxactGlobalXmin. ReleasePredicateLocks calls
SetNewSxactGlobalXmin *before* it releases the aborted
transaction's SerializableXact (it pretty much has to, because we
must drop and reacquire SerializableXactHashLock in between). But
we don't want the aborted transaction included in the
SxactGlobalXmin computation.

It seems like we do need that SXACT_FLAG_ROLLED_BACK after all,
even though it's only set for this brief interval. We need to be
able to distinguish a transaction that's just been marked for
death (doomed) from one that's already called
ReleasePredicateLocks.

I just looked this over and I agree. It looks to me like we can set
this flag in ReleasePredicateLocks (along with the doomed flag) and
test it only in SetNewSxactGlobalXmin (instead of the doomed flag).
Fundamentally, the problem is that while it's enough to know that a
transaction *will be* canceled in order to ignore it during
detection of rw-conflicts and dangerous structures, it's not enough
when calculating the new serializable xmin -- we need to know that a
transaction actually *has been* canceled to ignore it there.

Essentially, the fix is to revert a very small part of
http://git.postgresql.org/gitweb?p=postgresql.git;a=commit;h=264a6b127a918800d9f8bac80b5f4a8a8799d0f1

Either of us could easily fix this, but since you have the test
environment which can reproduce the problem, it's probably best that
you do it.

Since Heikki took the trouble to put the flags in a nice logical
order, we should probably put this right after the doomed flag.

-Kevin

#6Dan Ports
drkp@csail.mit.edu
In reply to: Dan Ports (#4)
2 attachment(s)
Re: pika buildfarm member failure on isolationCheck tests

While testing the fix for this one, I found another bug. Patches for
both are attached.

The first patch addresses this bug by re-adding SXACT_FLAG_ROLLED_BACK,
in a more limited form than its previous incarnation.

We need to be able to distinguish transactions that have already
called ReleasePredicateLocks and are thus eligible for cleanup from
those that have been merely marked for abort by other
backends. Transactions that are ROLLED_BACK are excluded from
SxactGlobalXmin calculations, but those that are merely DOOMED need to
be included.

Also update a couple of assertions to ensure we only try to clean up
ROLLED_BACK transactions.

The second patch fixes a bug in PreCommit_CheckForSerializationFailure.
This function checks whether there's a dangerous structure of the form
far ---> near ---> me
where neither the "far" or "near" transactions have committed. If so,
it aborts the "near" transaction by marking it as DOOMED. However, that
transaction might already be PREPARED. We need to check whether that's
the case and, if so, abort the transaction that's trying to commit
instead.

One of the prepared_xacts regression tests actually hits this bug.
I removed the anomaly from the duplicate-gids test so that it fails in
the intended way, and added a new test to check serialization failures
with a prepared transaction.

Dan

--
Dan R. K. Ports MIT CSAIL http://drkp.net/

Attachments:

check-conflict-prepared.patchtext/x-diff; charset=us-asciiDownload
diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index 35a25e0..6c55211 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -4542,16 +4542,6 @@ PreCommit_CheckForSerializationFailure(void)
 						&& !SxactIsReadOnly(farConflict->sxactOut)
 						&& !SxactIsDoomed(farConflict->sxactOut)))
 				{
-					if (SxactIsPrepared(nearConflict->sxactOut))
-					{
-						LWLockRelease(SerializableXactHashLock);
-						ereport(ERROR,
-								(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
-								 errmsg("could not serialize access due to read/write dependencies among transactions"),
-								 errdetail("Cancelled on commit attempt with conflict in from prepared pivot."),
-								 errhint("The transaction might succeed if retried.")));
-						
-					}
 					nearConflict->sxactOut->flags |= SXACT_FLAG_DOOMED;
 					break;
 				}
diff --git a/src/test/regress/expected/prepared_xacts.out b/src/test/regress/expected/prepared_xacts.out
index 328cd74..1a6b4ce 100644
--- a/src/test/regress/expected/prepared_xacts.out
+++ b/src/test/regress/expected/prepared_xacts.out
@@ -88,17 +88,17 @@ SELECT gid FROM pg_prepared_xacts;
 
 BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE;
 INSERT INTO pxtest1 VALUES ('fff');
--- This should fail, because the gid foo3 is already in use
-PREPARE TRANSACTION 'foo3';
-ERROR:  transaction identifier "foo3" is already in use
 SELECT * FROM pxtest1;
  foobar 
 --------
  aaa
  ddd
-(2 rows)
+ fff
+(3 rows)
 
-ROLLBACK PREPARED 'foo3';
+-- This should fail, because the gid foo3 is already in use
+PREPARE TRANSACTION 'foo3';
+ERROR:  transaction identifier "foo3" is already in use
 SELECT * FROM pxtest1;
  foobar 
 --------
@@ -106,24 +106,7 @@ SELECT * FROM pxtest1;
  ddd
 (2 rows)
 
--- Test serialization failure (SSI)
-BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE;
-UPDATE pxtest1 SET foobar = 'eee' WHERE foobar = 'ddd';
-SELECT * FROM pxtest1;
- foobar 
---------
- aaa
- eee
-(2 rows)
-
-PREPARE TRANSACTION 'foo4';
-SELECT gid FROM pg_prepared_xacts;
- gid  
-------
- foo4
-(1 row)
-
-BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE;
+ROLLBACK PREPARED 'foo3';
 SELECT * FROM pxtest1;
  foobar 
 --------
@@ -131,24 +114,6 @@ SELECT * FROM pxtest1;
  ddd
 (2 rows)
 
-INSERT INTO pxtest1 VALUES ('fff');
--- This should fail, because the two transactions have a write-skew anomaly
-PREPARE TRANSACTION 'foo5';
-ERROR:  could not serialize access due to read/write dependencies among transactions
-DETAIL:  Cancelled on commit attempt with conflict in from prepared pivot.
-HINT:  The transaction might succeed if retried.
-SELECT gid FROM pg_prepared_xacts;
- gid  
-------
- foo4
-(1 row)
-
-ROLLBACK PREPARED 'foo4';
-SELECT gid FROM pg_prepared_xacts;
- gid 
------
-(0 rows)
-
 -- Clean up
 DROP TABLE pxtest1;
 -- Test subtransactions
diff --git a/src/test/regress/sql/prepared_xacts.sql b/src/test/regress/sql/prepared_xacts.sql
index e06c9d4..2bdbb0d 100644
--- a/src/test/regress/sql/prepared_xacts.sql
+++ b/src/test/regress/sql/prepared_xacts.sql
@@ -54,6 +54,7 @@ SELECT gid FROM pg_prepared_xacts;
 
 BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE;
 INSERT INTO pxtest1 VALUES ('fff');
+SELECT * FROM pxtest1;
 
 -- This should fail, because the gid foo3 is already in use
 PREPARE TRANSACTION 'foo3';
@@ -64,27 +65,6 @@ ROLLBACK PREPARED 'foo3';
 
 SELECT * FROM pxtest1;
 
--- Test serialization failure (SSI)
-BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE;
-UPDATE pxtest1 SET foobar = 'eee' WHERE foobar = 'ddd';
-SELECT * FROM pxtest1;
-PREPARE TRANSACTION 'foo4';
-
-SELECT gid FROM pg_prepared_xacts;
-
-BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE;
-SELECT * FROM pxtest1;
-INSERT INTO pxtest1 VALUES ('fff');
-
--- This should fail, because the two transactions have a write-skew anomaly
-PREPARE TRANSACTION 'foo5';
-
-SELECT gid FROM pg_prepared_xacts;
-
-ROLLBACK PREPARED 'foo4';
-
-SELECT gid FROM pg_prepared_xacts;
-
 -- Clean up
 DROP TABLE pxtest1;
 
readd-rolled-back-flag.patchtext/x-diff; charset=us-asciiDownload
diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index 6c55211..3678878 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -246,7 +246,6 @@
 
 #define SxactIsCommitted(sxact) (((sxact)->flags & SXACT_FLAG_COMMITTED) != 0)
 #define SxactIsPrepared(sxact) (((sxact)->flags & SXACT_FLAG_PREPARED) != 0)
-#define SxactIsRolledBack(sxact) (((sxact)->flags & SXACT_FLAG_ROLLED_BACK) != 0)
 #define SxactIsDoomed(sxact) (((sxact)->flags & SXACT_FLAG_DOOMED) != 0)
 #define SxactIsReadOnly(sxact) (((sxact)->flags & SXACT_FLAG_READ_ONLY) != 0)
 #define SxactHasSummaryConflictIn(sxact) (((sxact)->flags & SXACT_FLAG_SUMMARY_CONFLICT_IN) != 0)
@@ -3047,7 +3046,7 @@ SetNewSxactGlobalXmin(void)
 
 	for (sxact = FirstPredXact(); sxact != NULL; sxact = NextPredXact(sxact))
 	{
-		if (!SxactIsRolledBack(sxact)
+		if (!SxactIsDoomed(sxact)
 			&& !SxactIsCommitted(sxact)
 			&& sxact != OldCommittedSxact)
 		{
@@ -3114,7 +3113,6 @@ ReleasePredicateLocks(const bool isCommit)
 	Assert(!isCommit || SxactIsPrepared(MySerializableXact));
 	Assert(!isCommit || !SxactIsDoomed(MySerializableXact));
 	Assert(!SxactIsCommitted(MySerializableXact));
-	Assert(!SxactIsRolledBack(MySerializableXact));
 
 	/* may not be serializable during COMMIT/ROLLBACK PREPARED */
 	if (MySerializableXact->pid != 0)
@@ -3153,22 +3151,7 @@ ReleasePredicateLocks(const bool isCommit)
 			MySerializableXact->flags |= SXACT_FLAG_READ_ONLY;
 	}
 	else
-	{
-		/*
-		 * The DOOMED flag indicates that we intend to roll back this
-		 * transaction and so it should not cause serialization
-		 * failures for other transactions that conflict with
-		 * it. Note that this flag might already be set, if another
-		 * backend marked this transaction for abort.
-		 *
-		 * The ROLLED_BACK flag further indicates that
-		 * ReleasePredicateLocks has been called, and so the
-		 * SerializableXact is eligible for cleanup. This means it
-		 * should not be considered when calculating SxactGlobalXmin.
-		 */
 		MySerializableXact->flags |= SXACT_FLAG_DOOMED;
-		MySerializableXact->flags |= SXACT_FLAG_ROLLED_BACK;
-	}
 
 	if (!topLevelIsDeclaredReadOnly)
 	{
@@ -3544,7 +3527,7 @@ ReleaseOneSerializableXact(SERIALIZABLEXACT *sxact, bool partial,
 				nextConflict;
 
 	Assert(sxact != NULL);
-	Assert(SxactIsRolledBack(sxact) || SxactIsCommitted(sxact));
+	Assert(SxactIsDoomed(sxact) || SxactIsCommitted(sxact));
 	Assert(LWLockHeldByMe(SerializableFinishedListLock));
 
 	/*
diff --git a/src/include/storage/predicate_internals.h b/src/include/storage/predicate_internals.h
index 34c661d..495983f 100644
--- a/src/include/storage/predicate_internals.h
+++ b/src/include/storage/predicate_internals.h
@@ -90,22 +90,21 @@ typedef struct SERIALIZABLEXACT
 	int			pid;			/* pid of associated process */
 } SERIALIZABLEXACT;
 
-#define SXACT_FLAG_COMMITTED			0x00000001	/* already committed */
-#define SXACT_FLAG_PREPARED				0x00000002	/* about to commit */
-#define SXACT_FLAG_ROLLED_BACK			0x00000004	/* already rolled back */
-#define SXACT_FLAG_DOOMED				0x00000008	/* will roll back */
+#define SXACT_FLAG_COMMITTED				0x00000001	/* already committed */
+#define SXACT_FLAG_PREPARED					0x00000002	/* about to commit */
+#define SXACT_FLAG_DOOMED					0x00000004	/* will roll back */
 /*
  * The following flag actually means that the flagged transaction has a
  * conflict out *to a transaction which committed ahead of it*.  It's hard
  * to get that into a name of a reasonable length.
  */
-#define SXACT_FLAG_CONFLICT_OUT			0x00000010
-#define SXACT_FLAG_READ_ONLY			0x00000020
-#define SXACT_FLAG_DEFERRABLE_WAITING	0x00000040
-#define SXACT_FLAG_RO_SAFE				0x00000080
-#define SXACT_FLAG_RO_UNSAFE			0x00000100
-#define SXACT_FLAG_SUMMARY_CONFLICT_IN	0x00000200
-#define SXACT_FLAG_SUMMARY_CONFLICT_OUT	0x00000400
+#define SXACT_FLAG_CONFLICT_OUT				0x00000008
+#define SXACT_FLAG_READ_ONLY				0x00000010
+#define SXACT_FLAG_DEFERRABLE_WAITING		0x00000020
+#define SXACT_FLAG_RO_SAFE					0x00000040
+#define SXACT_FLAG_RO_UNSAFE				0x00000080
+#define SXACT_FLAG_SUMMARY_CONFLICT_IN		0x00000100
+#define SXACT_FLAG_SUMMARY_CONFLICT_OUT		0x00000200
 
 /*
  * The following types are used to provide an ad hoc list for holding
#7Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Dan Ports (#6)
Re: pika buildfarm member failure on isolationCheck tests

On 21.06.2011 05:18, Dan Ports wrote:

The first patch addresses this bug by re-adding SXACT_FLAG_ROLLED_BACK,
in a more limited form than its previous incarnation.

We need to be able to distinguish transactions that have already
called ReleasePredicateLocks and are thus eligible for cleanup from
those that have been merely marked for abort by other
backends. Transactions that are ROLLED_BACK are excluded from
SxactGlobalXmin calculations, but those that are merely DOOMED need to
be included.

Also update a couple of assertions to ensure we only try to clean up
ROLLED_BACK transactions.

Thanks, committed.

The second patch fixes a bug in PreCommit_CheckForSerializationFailure.
This function checks whether there's a dangerous structure of the form
far ---> near ---> me
where neither the "far" or "near" transactions have committed. If so,
it aborts the "near" transaction by marking it as DOOMED. However, that
transaction might already be PREPARED. We need to check whether that's
the case and, if so, abort the transaction that's trying to commit
instead.

Yep, committed. All the other places where we set the DOOMED flag seem
to handle that already.

In the long term, I'm not sure this is the best way to handle this. It
feels a bit silly to set the flag, release the SerializableXactHashLock,
and reacquire it later to remove the transaction from the hash table.
Surely that could be done in some more straightforward way. But I don't
feel like fiddling with it this late in the release cycle.

One of the prepared_xacts regression tests actually hits this bug.
I removed the anomaly from the duplicate-gids test so that it fails in
the intended way, and added a new test to check serialization failures
with a prepared transaction.

Hmm, I have ran "make installcheck" with
default_transaction_isolation='serializable' earlier. I wonder why I
didn't see that.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#8Dan Ports
drkp@csail.mit.edu
In reply to: Heikki Linnakangas (#7)
Re: pika buildfarm member failure on isolationCheck tests

On Tue, Jun 21, 2011 at 03:01:48PM +0300, Heikki Linnakangas wrote:

Thanks, committed.

Thanks.

In the long term, I'm not sure this is the best way to handle this. It
feels a bit silly to set the flag, release the SerializableXactHashLock,
and reacquire it later to remove the transaction from the hash table.
Surely that could be done in some more straightforward way. But I don't
feel like fiddling with it this late in the release cycle.

Yes, I suspect it can be done better. The reason it's tricky is a lock
ordering issue; part of releasing a SerializableXact has to be done
while holding SerializableXactHashLock and part has to be done without
it (because it involves taking partition locks). Reworking the order of
these things might work, but would require some careful thought since
most of the code is shared with the non-abort cleanup paths. And yes,
it's definitely the time for that.

I've been meaning to take another look at this part of the code anyway,
with an eye towards performance. SerializableXactHashLock can be a
bottleneck on certain in-memory workloads.

One of the prepared_xacts regression tests actually hits this bug.
I removed the anomaly from the duplicate-gids test so that it fails in
the intended way, and added a new test to check serialization failures
with a prepared transaction.

Hmm, I have ran "make installcheck" with
default_transaction_isolation='serializable' earlier. I wonder why I
didn't see that.

The affected test was being run at SERIALIZABLE already, so that
wouldn't have made a difference. One reason I didn't notice an issue
when I looked at that test before before, is that it was intended to
fail anyway, just with a different error. I didn't think too hard about
which error would take precedence.

Dan

--
Dan R. K. Ports MIT CSAIL http://drkp.net/

#9Dan Ports
drkp@csail.mit.edu
In reply to: Dan Ports (#8)
Re: pika buildfarm member failure on isolationCheck tests

On Wed, Jun 22, 2011 at 01:31:11AM -0400, Dan Ports wrote:

Yes, I suspect it can be done better. The reason it's tricky is a lock
ordering issue; part of releasing a SerializableXact has to be done
while holding SerializableXactHashLock and part has to be done without
it (because it involves taking partition locks). Reworking the order of
these things might work, but would require some careful thought since
most of the code is shared with the non-abort cleanup paths. And yes,
it's definitely the time for that.

...by which I mean it's definitely *not* the time for that, of course.

Dan

--
Dan R. K. Ports MIT CSAIL http://drkp.net/