Problem with txid_snapshot_in/out() functionality

Started by Jan Wieckalmost 12 years ago31 messages
#1Jan Wieck
jan@wi3ck.info

Hackers,

the Slony team has been getting seldom reports of a problem with the
txid_snapshot data type.

The symptom is that a txid_snapshot on output lists the same txid
multiple times in the xip list part of the external representation. This
string is later rejected by txid_snapshot_in() when trying to determine
if a particular txid is visible in that snapshot using the
txid_visible_in_snapshot() function.

I was not yet able to reproduce this problem in a lab environment. It
might be related to subtransactions and/or two phase commit (at least
one user is using both of them). The reported PostgreSQL version
involved in that case was 9.1.

At this point I would find it extremely helpful to "sanitize" the
external representation in txid_snapshot_out() while emitting some
NOTICE level logging when this actually happens. I am aware that this
does amount to a functional change for a back release, but considering
that the _out() generated external representation of an existing binary
datum won't pass the type's _in() function, I argue that such change is
warranted. Especially since this problem could possibly corrupt a dump.

Comments?

Jan

--
Jan Wieck
Senior Software Engineer
http://slony.info

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#2Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Jan Wieck (#1)
Re: Problem with txid_snapshot_in/out() functionality

On 04/12/2014 12:07 AM, Jan Wieck wrote:

Hackers,

the Slony team has been getting seldom reports of a problem with the
txid_snapshot data type.

The symptom is that a txid_snapshot on output lists the same txid
multiple times in the xip list part of the external representation. This
string is later rejected by txid_snapshot_in() when trying to determine
if a particular txid is visible in that snapshot using the
txid_visible_in_snapshot() function.

I was not yet able to reproduce this problem in a lab environment. It
might be related to subtransactions and/or two phase commit (at least
one user is using both of them). The reported PostgreSQL version
involved in that case was 9.1.

It's two-phase commit. When preparing a transaction, the state of the
transaction is first transfered to a dummy PGXACT entry, and then the
PGXACT entry of the backend is cleared. There is a transient state when
both PGXACT entries have the same xid.

You can reproduce that by putting a sleep or breakpoint in
PrepareTransaction(), just before the
"ProcArrayClearTransaction(MyProc);" call. If you call
txid_current_snapshot() from another session at that point, it will
output two duplicate xids. (you will have to also commit one more
unrelated transaction to bump up xmax).

At this point I would find it extremely helpful to "sanitize" the
external representation in txid_snapshot_out() while emitting some
NOTICE level logging when this actually happens. I am aware that this
does amount to a functional change for a back release, but considering
that the _out() generated external representation of an existing binary
datum won't pass the type's _in() function, I argue that such change is
warranted. Especially since this problem could possibly corrupt a dump.

Hmm. Do we snapshots to be stored in tables, and included in a dump? I
don't think we can guarantee that will work, at least not across
versions, as the way we handle snapshot internally can change.

But yeah, we probably should do something about that. The most
straightforward fix would be to scan the array in
txid_current_snapshot() and remove any duplicates.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Jan Wieck
jan@wi3ck.info
In reply to: Heikki Linnakangas (#2)
Re: Problem with txid_snapshot_in/out() functionality

On 04/12/14 03:27, Heikki Linnakangas wrote:

On 04/12/2014 12:07 AM, Jan Wieck wrote:

Hackers,

the Slony team has been getting seldom reports of a problem with the
txid_snapshot data type.

The symptom is that a txid_snapshot on output lists the same txid
multiple times in the xip list part of the external representation. This
string is later rejected by txid_snapshot_in() when trying to determine
if a particular txid is visible in that snapshot using the
txid_visible_in_snapshot() function.

I was not yet able to reproduce this problem in a lab environment. It
might be related to subtransactions and/or two phase commit (at least
one user is using both of them). The reported PostgreSQL version
involved in that case was 9.1.

It's two-phase commit. When preparing a transaction, the state of the
transaction is first transfered to a dummy PGXACT entry, and then the
PGXACT entry of the backend is cleared. There is a transient state when
both PGXACT entries have the same xid.

You can reproduce that by putting a sleep or breakpoint in
PrepareTransaction(), just before the
"ProcArrayClearTransaction(MyProc);" call. If you call
txid_current_snapshot() from another session at that point, it will
output two duplicate xids. (you will have to also commit one more
unrelated transaction to bump up xmax).

Thanks, that explains it.

At this point I would find it extremely helpful to "sanitize" the
external representation in txid_snapshot_out() while emitting some
NOTICE level logging when this actually happens. I am aware that this
does amount to a functional change for a back release, but considering
that the _out() generated external representation of an existing binary
datum won't pass the type's _in() function, I argue that such change is
warranted. Especially since this problem could possibly corrupt a dump.

Hmm. Do we snapshots to be stored in tables, and included in a dump? I
don't think we can guarantee that will work, at least not across
versions, as the way we handle snapshot internally can change.

At least Londiste and Slony do store snapshots as well as xids in tables
and assuming that the txid epoch is properly bumped, that information is
useful and valid after a restore.

But yeah, we probably should do something about that. The most
straightforward fix would be to scan the array in
txid_current_snapshot() and remove any duplicates.

The code in txid_snapshot_in() checks that the xip list is ascending.
txid_snapshot_out() does not sort the list, so it must already be sorted
when the snapshot itself is created. That scan would be fairly simple.

Jan

--
Jan Wieck
Senior Software Engineer
http://slony.info

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Andres Freund
andres@2ndquadrant.com
In reply to: Heikki Linnakangas (#2)
Re: Problem with txid_snapshot_in/out() functionality

On 2014-04-12 10:27:16 +0300, Heikki Linnakangas wrote:

On 04/12/2014 12:07 AM, Jan Wieck wrote:

The symptom is that a txid_snapshot on output lists the same txid
multiple times in the xip list part of the external representation. This
string is later rejected by txid_snapshot_in() when trying to determine
if a particular txid is visible in that snapshot using the
txid_visible_in_snapshot() function.

It's two-phase commit. When preparing a transaction, the state of the
transaction is first transfered to a dummy PGXACT entry, and then the PGXACT
entry of the backend is cleared. There is a transient state when both PGXACT
entries have the same xid.

Which I find to be a pretty bad idea independent of this bug. But I
think that's nothing fixable in the back branches.

Hmm. Do we snapshots to be stored in tables, and included in a dump? I don't
think we can guarantee that will work, at least not across versions, as the
way we handle snapshot internally can change.

Hm. I don't think we'll earn much love changing that - there's at the
very least slony and londiste out there using it... IIRC both store the
result in tables.

But yeah, we probably should do something about that. The most
straightforward fix would be to scan the array in txid_current_snapshot()
and remove any duplicates.

Since it's sorted there, that should be fairly straightforward. Won't
fix already created and stored datums tho. Maybe _in()/parse_snapshot()
should additionally skip over duplicate values? Looks easy enough.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Jan Wieck
jan@wi3ck.info
In reply to: Andres Freund (#4)
Re: Problem with txid_snapshot_in/out() functionality

On 04/12/14 08:38, Andres Freund wrote:

Since it's sorted there, that should be fairly straightforward. Won't
fix already created and stored datums tho. Maybe _in()/parse_snapshot()
should additionally skip over duplicate values? Looks easy enough.

There is the sort ... missed that when glancing over the code earlier.

Right, that is easy enough and looks like an acceptable fix for back
branches too.

Thanks,
Jan

--
Jan Wieck
Senior Software Engineer
http://slony.info

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Heikki Linnakangas (#2)
Re: Problem with txid_snapshot_in/out() functionality

Heikki Linnakangas <hlinnakangas@vmware.com> writes:

On 04/12/2014 12:07 AM, Jan Wieck wrote:

the Slony team has been getting seldom reports of a problem with the
txid_snapshot data type.
The symptom is that a txid_snapshot on output lists the same txid
multiple times in the xip list part of the external representation.

It's two-phase commit. When preparing a transaction, the state of the
transaction is first transfered to a dummy PGXACT entry, and then the
PGXACT entry of the backend is cleared. There is a transient state when
both PGXACT entries have the same xid.

Hm, yeah, but why is that intermediate state visible to anyone else?
Don't we have exclusive lock on the PGPROC array while we're doing this?
If we don't, aren't we letting other backends see non-self-consistent
state in regards to who holds which locks, for example?

I'm worried that the proposed fix is just band-aiding one particular
symptom of inadequate locking.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#7Andres Freund
andres@2ndquadrant.com
In reply to: Tom Lane (#6)
Re: Problem with txid_snapshot_in/out() functionality

On 2014-04-12 09:47:24 -0400, Tom Lane wrote:

Heikki Linnakangas <hlinnakangas@vmware.com> writes:

On 04/12/2014 12:07 AM, Jan Wieck wrote:

the Slony team has been getting seldom reports of a problem with the
txid_snapshot data type.
The symptom is that a txid_snapshot on output lists the same txid
multiple times in the xip list part of the external representation.

It's two-phase commit. When preparing a transaction, the state of the
transaction is first transfered to a dummy PGXACT entry, and then the
PGXACT entry of the backend is cleared. There is a transient state when
both PGXACT entries have the same xid.

Hm, yeah, but why is that intermediate state visible to anyone else?
Don't we have exclusive lock on the PGPROC array while we're doing this?

It's done outside the remit of ProcArray lock :(. And documented to lead
to duplicate xids in PGXACT.
EndPrepare():
/*
* Mark the prepared transaction as valid. As soon as xact.c marks
* MyPgXact as not running our XID (which it will do immediately after
* this function returns), others can commit/rollback the xact.
*
* NB: a side effect of this is to make a dummy ProcArray entry for the
* prepared XID. This must happen before we clear the XID from MyPgXact,
* else there is a window where the XID is not running according to
* TransactionIdIsInProgress, and onlookers would be entitled to assume
* the xact crashed. Instead we have a window where the same XID appears
* twice in ProcArray, which is OK.
*/
MarkAsPrepared(gxact);

It doesn't sound too hard to essentially move PrepareTransaction()'s
ProcArrayClearTransaction() into MarkAsPrepared() and rejigger the
locking to remove the intermediate state. But I think it'll lead to
bigger changes than we'd be comfortable backpatching.

If we don't, aren't we letting other backends see non-self-consistent
state in regards to who holds which locks, for example?

I think that actually works out ok, because the locks aren't owned by
xids/xacts, but procs. Otherwise we'd be in deep trouble in
CommitTransaction() as well where ProcArrayEndTransaction() clearing
that state.
After the whole xid transfer, there's PostPrepare_Locks() transferring
the locks.

Brr.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Greg Stark
stark@mit.edu
In reply to: Jan Wieck (#3)
Re: Problem with txid_snapshot_in/out() functionality

On 12 Apr 2014 08:35, "Jan Wieck" <jan@wi3ck.info> wrote:

On 04/12/14 03:27, Heikki Linnakangas wrote:

On 04/12/2014 12:07 AM, Jan Wieck wrote:

Hackers,

Hmm. Do we snapshots to be stored in tables, and included in a dump? I
don't think we can guarantee that will work, at least not across
versions, as the way we handle snapshot internally can change.

At least Londiste and Slony do store snapshots as well as xids in tables

and assuming that the txid epoch is properly bumped, that information is
useful and valid after a restore.

As I understand it the epoch increments whenever the xid wraps.

A physical restore would continue the same xid space in the same epoch
which should work fine as long as no system stores any txids outside the
database from the "future".

A pg_restore would start a new xid space from FirstNormalXid which would
obviously not work with any stored txids.

#9Jan Wieck
jan@wi3ck.info
In reply to: Greg Stark (#8)
Re: Problem with txid_snapshot_in/out() functionality

On 04/12/14 10:09, Greg Stark wrote:

A pg_restore would start a new xid space from FirstNormalXid which would
obviously not work with any stored txids.

http://www.postgresql.org/docs/9.3/static/app-pgresetxlog.html

Regards,
Jan

--
Jan Wieck
Senior Software Engineer
http://slony.info

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10Andres Freund
andres@2ndquadrant.com
In reply to: Jan Wieck (#9)
Re: Problem with txid_snapshot_in/out() functionality

On 2014-04-12 11:15:09 -0400, Jan Wieck wrote:

On 04/12/14 10:09, Greg Stark wrote:

A pg_restore would start a new xid space from FirstNormalXid which would
obviously not work with any stored txids.

http://www.postgresql.org/docs/9.3/static/app-pgresetxlog.html

Using that as part of any sort of routine task IMNSHO is a seriously bad
idea.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11Jan Wieck
jan@wi3ck.info
In reply to: Andres Freund (#10)
Re: Problem with txid_snapshot_in/out() functionality

On 04/12/14 11:18, Andres Freund wrote:

On 2014-04-12 11:15:09 -0400, Jan Wieck wrote:

On 04/12/14 10:09, Greg Stark wrote:

A pg_restore would start a new xid space from FirstNormalXid which would
obviously not work with any stored txids.

http://www.postgresql.org/docs/9.3/static/app-pgresetxlog.html

Using that as part of any sort of routine task IMNSHO is a seriously bad
idea.

Nobody is advocating doing so.

Jan

--
Jan Wieck
Senior Software Engineer
http://slony.info

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12Jan Wieck
jan@wi3ck.info
In reply to: Andres Freund (#7)
1 attachment(s)
Re: Problem with txid_snapshot_in/out() functionality

On 04/12/14 10:03, Andres Freund wrote:

On 2014-04-12 09:47:24 -0400, Tom Lane wrote:

Heikki Linnakangas <hlinnakangas@vmware.com> writes:

On 04/12/2014 12:07 AM, Jan Wieck wrote:

the Slony team has been getting seldom reports of a problem with the
txid_snapshot data type.
The symptom is that a txid_snapshot on output lists the same txid
multiple times in the xip list part of the external representation.

It's two-phase commit. When preparing a transaction, the state of the
transaction is first transfered to a dummy PGXACT entry, and then the
PGXACT entry of the backend is cleared. There is a transient state when
both PGXACT entries have the same xid.

Hm, yeah, but why is that intermediate state visible to anyone else?
Don't we have exclusive lock on the PGPROC array while we're doing this?

It's done outside the remit of ProcArray lock :(. And documented to lead
to duplicate xids in PGXACT.
EndPrepare():
/*
* Mark the prepared transaction as valid. As soon as xact.c marks
* MyPgXact as not running our XID (which it will do immediately after
* this function returns), others can commit/rollback the xact.
*
* NB: a side effect of this is to make a dummy ProcArray entry for the
* prepared XID. This must happen before we clear the XID from MyPgXact,
* else there is a window where the XID is not running according to
* TransactionIdIsInProgress, and onlookers would be entitled to assume
* the xact crashed. Instead we have a window where the same XID appears
* twice in ProcArray, which is OK.
*/
MarkAsPrepared(gxact);

It doesn't sound too hard to essentially move PrepareTransaction()'s
ProcArrayClearTransaction() into MarkAsPrepared() and rejigger the
locking to remove the intermediate state. But I think it'll lead to
bigger changes than we'd be comfortable backpatching.

If we don't, aren't we letting other backends see non-self-consistent
state in regards to who holds which locks, for example?

I think that actually works out ok, because the locks aren't owned by
xids/xacts, but procs. Otherwise we'd be in deep trouble in
CommitTransaction() as well where ProcArrayEndTransaction() clearing
that state.
After the whole xid transfer, there's PostPrepare_Locks() transferring
the locks.

Since it doesn't seem to produce any side effects, I'd think that making
the snapshot unique within txid_current_snapshot() and filtering
duplicates on input should be sufficient and eligible for backpatching.

The attached patch adds a unique loop to the internal sort_snapshot()
function and skips duplicates on input. The git commit is here:

https://github.com/wieck/postgres/commit/a88a2b2c25b856478d7e2b012fc718106338fe00

Regards,
Jan

--
Jan Wieck
Senior Software Engineer
http://slony.info

Attachments:

txid_snapshot_filter_duplicate_xip.difftext/plain; charset=us-ascii; name=txid_snapshot_filter_duplicate_xip.diffDownload
diff --git a/src/backend/utils/adt/txid.c b/src/backend/utils/adt/txid.c
index a005e67..44f7880 100644
*** a/src/backend/utils/adt/txid.c
--- b/src/backend/utils/adt/txid.c
*************** cmp_txid(const void *aa, const void *bb)
*** 131,146 ****
  }
  
  /*
!  * sort a snapshot's txids, so we can use bsearch() later.
   *
   * For consistency of on-disk representation, we always sort even if bsearch
!  * will not be used.
   */
  static void
  sort_snapshot(TxidSnapshot *snap)
  {
  	if (snap->nxip > 1)
  		qsort(snap->xip, snap->nxip, sizeof(txid), cmp_txid);
  }
  
  /*
--- 131,161 ----
  }
  
  /*
!  * unique sort a snapshot's txids, so we can use bsearch() later.
   *
   * For consistency of on-disk representation, we always sort even if bsearch
!  * will not be used. 
   */
  static void
  sort_snapshot(TxidSnapshot *snap)
  {
+ 	txid	last = 0;
+ 	int		nxip, idx1, idx2;
+ 
  	if (snap->nxip > 1)
+ 	{
  		qsort(snap->xip, snap->nxip, sizeof(txid), cmp_txid);
+ 		nxip = snap->nxip;
+ 		idx1 = idx2 = 0;
+ 		while (idx1 < nxip)
+ 		{
+ 			if (snap->xip[idx1] != last)
+ 				last = snap->xip[idx2++] = snap->xip[idx1];
+ 			else
+ 				snap->nxip--;
+ 			idx1++;
+ 		}
+ 	}
  }
  
  /*
*************** parse_snapshot(const char *str)
*** 294,304 ****
  		val = str2txid(str, &endp);
  		str = endp;
  
! 		/* require the input to be in order */
! 		if (val < xmin || val >= xmax || val <= last_val)
  			goto bad_format;
  
! 		buf_add_txid(buf, val);
  		last_val = val;
  
  		if (*str == ',')
--- 309,320 ----
  		val = str2txid(str, &endp);
  		str = endp;
  
! 		/* require the input to be in order and skip duplicates */
! 		if (val < xmin || val >= xmax || val < last_val)
  			goto bad_format;
  
! 		if (val != last_val)
! 			buf_add_txid(buf, val);
  		last_val = val;
  
  		if (*str == ',')
diff --git a/src/test/regress/expected/txid.out b/src/test/regress/expected/txid.out
index 930b86a..7750b7b 100644
*** a/src/test/regress/expected/txid.out
--- b/src/test/regress/expected/txid.out
*************** select '12:18:14,16'::txid_snapshot;
*** 12,17 ****
--- 12,23 ----
   12:18:14,16
  (1 row)
  
+ select '12:16:14,14'::txid_snapshot;
+  txid_snapshot 
+ ---------------
+  12:16:14
+ (1 row)
+ 
  -- errors
  select '31:12:'::txid_snapshot;
  ERROR:  invalid input for txid_snapshot: "31:12:"
*************** select '12:16:14,13'::txid_snapshot;
*** 29,38 ****
  ERROR:  invalid input for txid_snapshot: "12:16:14,13"
  LINE 1: select '12:16:14,13'::txid_snapshot;
                 ^
- select '12:16:14,14'::txid_snapshot;
- ERROR:  invalid input for txid_snapshot: "12:16:14,14"
- LINE 1: select '12:16:14,14'::txid_snapshot;
-                ^
  create temp table snapshot_test (
  	nr	integer,
  	snap	txid_snapshot
--- 35,40 ----
diff --git a/src/test/regress/sql/txid.sql b/src/test/regress/sql/txid.sql
index ecae10e..b6650b9 100644
*** a/src/test/regress/sql/txid.sql
--- b/src/test/regress/sql/txid.sql
***************
*** 3,15 ****
  -- i/o
  select '12:13:'::txid_snapshot;
  select '12:18:14,16'::txid_snapshot;
  
  -- errors
  select '31:12:'::txid_snapshot;
  select '0:1:'::txid_snapshot;
  select '12:13:0'::txid_snapshot;
  select '12:16:14,13'::txid_snapshot;
- select '12:16:14,14'::txid_snapshot;
  
  create temp table snapshot_test (
  	nr	integer,
--- 3,15 ----
  -- i/o
  select '12:13:'::txid_snapshot;
  select '12:18:14,16'::txid_snapshot;
+ select '12:16:14,14'::txid_snapshot;
  
  -- errors
  select '31:12:'::txid_snapshot;
  select '0:1:'::txid_snapshot;
  select '12:13:0'::txid_snapshot;
  select '12:16:14,13'::txid_snapshot;
  
  create temp table snapshot_test (
  	nr	integer,
#13Marko Kreen
markokr@gmail.com
In reply to: Jan Wieck (#12)
Re: Problem with txid_snapshot_in/out() functionality

On Sat, Apr 12, 2014 at 02:10:13PM -0400, Jan Wieck wrote:

Since it doesn't seem to produce any side effects, I'd think that
making the snapshot unique within txid_current_snapshot() and
filtering duplicates on input should be sufficient and eligible for
backpatching.

Agreed.

The attached patch adds a unique loop to the internal
sort_snapshot() function and skips duplicates on input. The git
commit is here:

https://github.com/wieck/postgres/commit/a88a2b2c25b856478d7e2b012fc718106338fe00

static void
sort_snapshot(TxidSnapshot *snap)
{
+ 	txid	last = 0;
+ 	int		nxip, idx1, idx2;
+ 
if (snap->nxip > 1)
+ 	{
qsort(snap->xip, snap->nxip, sizeof(txid), cmp_txid);
+ 		nxip = snap->nxip;
+ 		idx1 = idx2 = 0;
+ 		while (idx1 < nxip)
+ 		{
+ 			if (snap->xip[idx1] != last)
+ 				last = snap->xip[idx2++] = snap->xip[idx1];
+ 			else
+ 				snap->nxip--;
+ 			idx1++;
+ 		}
+ 	}
}

I think you need to do SET_VARSIZE also here. Alternative is to
move SET_VARSIZE after sort_snapshot().

And it seems the drop-double-txid logic should be added also to
txid_snapshot_recv(). It seems weird to have it behave differently
from txid_snapshot_in().

--
marko

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14Jan Wieck
jan@wi3ck.info
In reply to: Marko Kreen (#13)
Re: Problem with txid_snapshot_in/out() functionality

On 04/13/14 08:27, Marko Kreen wrote:

On Sat, Apr 12, 2014 at 02:10:13PM -0400, Jan Wieck wrote:

Since it doesn't seem to produce any side effects, I'd think that
making the snapshot unique within txid_current_snapshot() and
filtering duplicates on input should be sufficient and eligible for
backpatching.

Agreed.

The attached patch adds a unique loop to the internal
sort_snapshot() function and skips duplicates on input. The git
commit is here:

https://github.com/wieck/postgres/commit/a88a2b2c25b856478d7e2b012fc718106338fe00

static void
sort_snapshot(TxidSnapshot *snap)
{
+ 	txid	last = 0;
+ 	int		nxip, idx1, idx2;
+
if (snap->nxip > 1)
+ 	{
qsort(snap->xip, snap->nxip, sizeof(txid), cmp_txid);
+ 		nxip = snap->nxip;
+ 		idx1 = idx2 = 0;
+ 		while (idx1 < nxip)
+ 		{
+ 			if (snap->xip[idx1] != last)
+ 				last = snap->xip[idx2++] = snap->xip[idx1];
+ 			else
+ 				snap->nxip--;
+ 			idx1++;
+ 		}
+ 	}
}

I think you need to do SET_VARSIZE also here. Alternative is to
move SET_VARSIZE after sort_snapshot().

And it seems the drop-double-txid logic should be added also to
txid_snapshot_recv(). It seems weird to have it behave differently
from txid_snapshot_in().

Thanks,

yes on both issues. Will create another patch.

Jan

--
Jan Wieck
Senior Software Engineer
http://slony.info

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Andres Freund (#7)
Race condition between PREPARE TRANSACTION and COMMIT PREPARED (was Re: Problem with txid_snapshot_in/out() functionality)

On 04/12/2014 05:03 PM, Andres Freund wrote:

If we don't, aren't we letting other backends see non-self-consistent
state in regards to who holds which locks, for example?

I think that actually works out ok, because the locks aren't owned by
xids/xacts, but procs. Otherwise we'd be in deep trouble in
CommitTransaction() as well where ProcArrayEndTransaction() clearing
that state.
After the whole xid transfer, there's PostPrepare_Locks() transferring
the locks.

Right.

However, I just noticed that there's a race condition between PREPARE
TRANSACTION and COMMIT/ROLLBACK PREPARED. PostPrepare_Locks runs after
the prepared transaction is already marked as fully prepared. That means
that by the time we get to PostPrepare_Locks, another backend might
already have finished and removed the prepared transaction. That leads
to a PANIC (put a breakpoint just before PostPrepare_Locks):

postgres=# commit prepared 'foo';
PANIC: failed to re-find shared proclock object
PANIC: failed to re-find shared proclock object
The connection to the server was lost. Attempting reset: Failed.

FinishPrepareTransaction reads the list of locks from the two-phase
state file, but PANICs when it doesn't find the corresponding locks in
the lock manager (because PostPrepare_Locks hasn't transfered them to
the dummy PGPROC yet).

I think we'll need to transfer of the locks earlier, before the
transaction is marked as fully prepared. I'll take a closer look at this
tomorrow.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16Jan Wieck
jan@wi3ck.info
In reply to: Jan Wieck (#14)
1 attachment(s)
Re: Problem with txid_snapshot_in/out() functionality

On 04/13/14 14:22, Jan Wieck wrote:

On 04/13/14 08:27, Marko Kreen wrote:

I think you need to do SET_VARSIZE also here. Alternative is to
move SET_VARSIZE after sort_snapshot().

And it seems the drop-double-txid logic should be added also to
txid_snapshot_recv(). It seems weird to have it behave differently
from txid_snapshot_in().

Thanks,

yes on both issues. Will create another patch.

New patch attached.

New github commit is
https://github.com/wieck/postgres/commit/b8fd0d2eb78791e5171e34aecd233fd06218890d

Thanks again,
Jan

--
Jan Wieck
Senior Software Engineer
http://slony.info

Attachments:

txid_snapshot_filter_duplicate_xip_v2.difftext/plain; charset=us-ascii; name=txid_snapshot_filter_duplicate_xip_v2.diffDownload
diff --git a/src/backend/utils/adt/txid.c b/src/backend/utils/adt/txid.c
index a005e67..1023566 100644
*** a/src/backend/utils/adt/txid.c
--- b/src/backend/utils/adt/txid.c
*************** cmp_txid(const void *aa, const void *bb)
*** 131,146 ****
  }
  
  /*
!  * sort a snapshot's txids, so we can use bsearch() later.
   *
   * For consistency of on-disk representation, we always sort even if bsearch
!  * will not be used.
   */
  static void
  sort_snapshot(TxidSnapshot *snap)
  {
  	if (snap->nxip > 1)
  		qsort(snap->xip, snap->nxip, sizeof(txid), cmp_txid);
  }
  
  /*
--- 131,161 ----
  }
  
  /*
!  * unique sort a snapshot's txids, so we can use bsearch() later.
   *
   * For consistency of on-disk representation, we always sort even if bsearch
!  * will not be used. 
   */
  static void
  sort_snapshot(TxidSnapshot *snap)
  {
+ 	txid	last = 0;
+ 	int		nxip, idx1, idx2;
+ 
  	if (snap->nxip > 1)
+ 	{
  		qsort(snap->xip, snap->nxip, sizeof(txid), cmp_txid);
+ 		nxip = snap->nxip;
+ 		idx1 = idx2 = 0;
+ 		while (idx1 < nxip)
+ 		{
+ 			if (snap->xip[idx1] != last)
+ 				last = snap->xip[idx2++] = snap->xip[idx1];
+ 			else
+ 				snap->nxip--;
+ 			idx1++;
+ 		}
+ 	}
  }
  
  /*
*************** parse_snapshot(const char *str)
*** 294,304 ****
  		val = str2txid(str, &endp);
  		str = endp;
  
! 		/* require the input to be in order */
! 		if (val < xmin || val >= xmax || val <= last_val)
  			goto bad_format;
  
! 		buf_add_txid(buf, val);
  		last_val = val;
  
  		if (*str == ',')
--- 309,320 ----
  		val = str2txid(str, &endp);
  		str = endp;
  
! 		/* require the input to be in order and skip duplicates */
! 		if (val < xmin || val >= xmax || val < last_val)
  			goto bad_format;
  
! 		if (val != last_val)
! 			buf_add_txid(buf, val);
  		last_val = val;
  
  		if (*str == ',')
*************** txid_current_snapshot(PG_FUNCTION_ARGS)
*** 361,368 ****
  {
  	TxidSnapshot *snap;
  	uint32		nxip,
! 				i,
! 				size;
  	TxidEpoch	state;
  	Snapshot	cur;
  
--- 377,383 ----
  {
  	TxidSnapshot *snap;
  	uint32		nxip,
! 				i;
  	TxidEpoch	state;
  	Snapshot	cur;
  
*************** txid_current_snapshot(PG_FUNCTION_ARGS)
*** 381,389 ****
  
  	/* allocate */
  	nxip = cur->xcnt;
! 	size = TXID_SNAPSHOT_SIZE(nxip);
! 	snap = palloc(size);
! 	SET_VARSIZE(snap, size);
  
  	/* fill */
  	snap->xmin = convert_xid(cur->xmin, &state);
--- 396,402 ----
  
  	/* allocate */
  	nxip = cur->xcnt;
! 	snap = palloc(TXID_SNAPSHOT_SIZE(nxip));
  
  	/* fill */
  	snap->xmin = convert_xid(cur->xmin, &state);
*************** txid_current_snapshot(PG_FUNCTION_ARGS)
*** 395,400 ****
--- 408,416 ----
  	/* we want them guaranteed to be in ascending order */
  	sort_snapshot(snap);
  
+ 	/* we set the size here because the sort may have removed duplicate xips */
+ 	SET_VARSIZE(snap, TXID_SNAPSHOT_SIZE(snap->nxip));
+ 
  	PG_RETURN_POINTER(snap);
  }
  
*************** txid_snapshot_recv(PG_FUNCTION_ARGS)
*** 472,489 ****
  	snap = palloc(TXID_SNAPSHOT_SIZE(nxip));
  	snap->xmin = xmin;
  	snap->xmax = xmax;
- 	snap->nxip = nxip;
- 	SET_VARSIZE(snap, TXID_SNAPSHOT_SIZE(nxip));
  
  	for (i = 0; i < nxip; i++)
  	{
  		txid		cur = pq_getmsgint64(buf);
  
! 		if (cur <= last || cur < xmin || cur >= xmax)
  			goto bad_format;
  		snap->xip[i] = cur;
  		last = cur;
  	}
  	PG_RETURN_POINTER(snap);
  
  bad_format:
--- 488,514 ----
  	snap = palloc(TXID_SNAPSHOT_SIZE(nxip));
  	snap->xmin = xmin;
  	snap->xmax = xmax;
  
  	for (i = 0; i < nxip; i++)
  	{
  		txid		cur = pq_getmsgint64(buf);
  
! 		if (cur < last || cur < xmin || cur >= xmax)
  			goto bad_format;
+ 
+ 		/* skip duplicate xips */
+ 		if (cur == last)
+ 		{
+ 			i--;
+ 			nxip--;
+ 			continue;
+ 		}
+ 
  		snap->xip[i] = cur;
  		last = cur;
  	}
+ 	snap->nxip = nxip;
+ 	SET_VARSIZE(snap, TXID_SNAPSHOT_SIZE(nxip));
  	PG_RETURN_POINTER(snap);
  
  bad_format:
diff --git a/src/test/regress/expected/txid.out b/src/test/regress/expected/txid.out
index 930b86a..7750b7b 100644
*** a/src/test/regress/expected/txid.out
--- b/src/test/regress/expected/txid.out
*************** select '12:18:14,16'::txid_snapshot;
*** 12,17 ****
--- 12,23 ----
   12:18:14,16
  (1 row)
  
+ select '12:16:14,14'::txid_snapshot;
+  txid_snapshot 
+ ---------------
+  12:16:14
+ (1 row)
+ 
  -- errors
  select '31:12:'::txid_snapshot;
  ERROR:  invalid input for txid_snapshot: "31:12:"
*************** select '12:16:14,13'::txid_snapshot;
*** 29,38 ****
  ERROR:  invalid input for txid_snapshot: "12:16:14,13"
  LINE 1: select '12:16:14,13'::txid_snapshot;
                 ^
- select '12:16:14,14'::txid_snapshot;
- ERROR:  invalid input for txid_snapshot: "12:16:14,14"
- LINE 1: select '12:16:14,14'::txid_snapshot;
-                ^
  create temp table snapshot_test (
  	nr	integer,
  	snap	txid_snapshot
--- 35,40 ----
diff --git a/src/test/regress/sql/txid.sql b/src/test/regress/sql/txid.sql
index ecae10e..b6650b9 100644
*** a/src/test/regress/sql/txid.sql
--- b/src/test/regress/sql/txid.sql
***************
*** 3,15 ****
  -- i/o
  select '12:13:'::txid_snapshot;
  select '12:18:14,16'::txid_snapshot;
  
  -- errors
  select '31:12:'::txid_snapshot;
  select '0:1:'::txid_snapshot;
  select '12:13:0'::txid_snapshot;
  select '12:16:14,13'::txid_snapshot;
- select '12:16:14,14'::txid_snapshot;
  
  create temp table snapshot_test (
  	nr	integer,
--- 3,15 ----
  -- i/o
  select '12:13:'::txid_snapshot;
  select '12:18:14,16'::txid_snapshot;
+ select '12:16:14,14'::txid_snapshot;
  
  -- errors
  select '31:12:'::txid_snapshot;
  select '0:1:'::txid_snapshot;
  select '12:13:0'::txid_snapshot;
  select '12:16:14,13'::txid_snapshot;
  
  create temp table snapshot_test (
  	nr	integer,
#17Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Heikki Linnakangas (#15)
1 attachment(s)
Re: Race condition between PREPARE TRANSACTION and COMMIT PREPARED (was Re: Problem with txid_snapshot_in/out() functionality)

On 04/13/2014 11:39 PM, Heikki Linnakangas wrote:

However, I just noticed that there's a race condition between PREPARE
TRANSACTION and COMMIT/ROLLBACK PREPARED. PostPrepare_Locks runs after
the prepared transaction is already marked as fully prepared. That means
that by the time we get to PostPrepare_Locks, another backend might
already have finished and removed the prepared transaction. That leads
to a PANIC (put a breakpoint just before PostPrepare_Locks):

postgres=# commit prepared 'foo';
PANIC: failed to re-find shared proclock object
PANIC: failed to re-find shared proclock object
The connection to the server was lost. Attempting reset: Failed.

FinishPrepareTransaction reads the list of locks from the two-phase
state file, but PANICs when it doesn't find the corresponding locks in
the lock manager (because PostPrepare_Locks hasn't transfered them to
the dummy PGPROC yet).

I think we'll need to transfer of the locks earlier, before the
transaction is marked as fully prepared. I'll take a closer look at this
tomorrow.

Here's a patch to do that. It's very straightforward, I just moved the
calls to transfer locks earlier, before ProcArrayClearTransaction.
PostPrepare_MultiXact had a similar race - it also transfer state from
the old PGPROC entry to the new, and needs to be done before allowing
another backend to remove the new PGPROC entry. I changed the names of
the functions to distinguish them from the other PostPrepare_* functions
that now happen at a different time.

The patch is simple, but it's a bit scary to change the order of things
like this. Looking at all the calls that now happen after transferring
the locks, I believe this is OK. The change also applies to the
callbacks called by the RegisterXactCallback mechanism, which means that
in theory there might be a 3rd party extension out there that's
affected. All the callbacks in contrib and plpgsql are OK, and it's
questionable to do anything complicated that would depend on
heavy-weight locks to be held in those callbacks, so I think this is OK.
Warrants a note in the release notes, though.

- Heikki

Attachments:

fix-prepare-transaction-race-1.patchtext/x-diff; name=fix-prepare-transaction-race-1.patchDownload
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index d4ad678..b505c62 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -1580,11 +1580,12 @@ AtPrepare_MultiXact(void)
 }
 
 /*
- * PostPrepare_MultiXact
- *		Clean up after successful PREPARE TRANSACTION
+ * TransferPrepare_MultiXact
+ *		Called after successful PREPARE TRANSACTION, before releasing our
+ *		PGPROC entry.
  */
 void
-PostPrepare_MultiXact(TransactionId xid)
+TransferPrepare_MultiXact(TransactionId xid)
 {
 	MultiXactId myOldestMember;
 
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index b20d973..c60edf1 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2224,6 +2224,16 @@ PrepareTransaction(void)
 	XactLastRecEnd = 0;
 
 	/*
+	 * Transfer any heavy-weight locks we're holding to the dummy ProcArray.
+	 *
+	 * NB: this has to be done before clearing our own ProcArray entry.
+	 * This is different from normal commit, where locks are released after
+	 * clearing the ProcArray entry!
+	 */
+	TransferPrepare_MultiXact(xid);	  /* also transfer our multixact state */
+	TransferPrepare_Locks(xid);
+
+	/*
 	 * Let others know about no transaction in progress by me.	This has to be
 	 * done *after* the prepared transaction has been marked valid, else
 	 * someone may think it is unlocked and recyclable.
@@ -2234,6 +2244,10 @@ PrepareTransaction(void)
 	 * This is all post-transaction cleanup.  Note that if an error is raised
 	 * here, it's too late to abort the transaction.  This should be just
 	 * noncritical resource releasing.	See notes in CommitTransaction.
+	 *
+	 * NB: we already transfered the locks to the prepared ProcArray entry,
+	 * so even the cleanup before ResourceOwnerRelease(RESOURCE_RELEASE_LOCKS)
+	 * cannot rely on any heavy-weight locks being held!
 	 */
 
 	CallXactCallbacks(XACT_EVENT_PREPARE);
@@ -2256,11 +2270,12 @@ PrepareTransaction(void)
 
 	PostPrepare_smgr();
 
-	PostPrepare_MultiXact(xid);
-
-	PostPrepare_Locks(xid);
 	PostPrepare_PredicateLocks(xid);
 
+	/*
+	 * we're not actually holding any locks anymore, but clean up any other
+	 * resources that might need to be cleaned up at this stage.
+	 */
 	ResourceOwnerRelease(TopTransactionResourceOwner,
 						 RESOURCE_RELEASE_LOCKS,
 						 true, true);
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 6825063..779f0cb 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -3069,8 +3069,8 @@ AtPrepare_Locks(void)
 }
 
 /*
- * PostPrepare_Locks
- *		Clean up after successful PREPARE
+ * TransferPrepare_Locks
+ *		Transfer locks to prepared transaction after successful PREPARE
  *
  * Here, we want to transfer ownership of our locks to a dummy PGPROC
  * that's now associated with the prepared transaction, and we want to
@@ -3084,7 +3084,7 @@ AtPrepare_Locks(void)
  * but that probably costs more cycles.
  */
 void
-PostPrepare_Locks(TransactionId xid)
+TransferPrepare_Locks(TransactionId xid)
 {
 	PGPROC	   *newproc = TwoPhaseGetDummyProc(xid);
 	HASH_SEQ_STATUS status;
diff --git a/src/include/access/multixact.h b/src/include/access/multixact.h
index 9729f27..cd40a35 100644
--- a/src/include/access/multixact.h
+++ b/src/include/access/multixact.h
@@ -96,7 +96,7 @@ extern bool MultiXactIdPrecedesOrEquals(MultiXactId multi1,
 
 extern void AtEOXact_MultiXact(void);
 extern void AtPrepare_MultiXact(void);
-extern void PostPrepare_MultiXact(TransactionId xid);
+extern void TransferPrepare_MultiXact(TransactionId xid);
 
 extern Size MultiXactShmemSize(void);
 extern void MultiXactShmemInit(void);
diff --git a/src/include/storage/lock.h b/src/include/storage/lock.h
index ceeab9f..2639faf 100644
--- a/src/include/storage/lock.h
+++ b/src/include/storage/lock.h
@@ -516,7 +516,7 @@ extern bool LockHasWaiters(const LOCKTAG *locktag,
 extern VirtualTransactionId *GetLockConflicts(const LOCKTAG *locktag,
 				 LOCKMODE lockmode);
 extern void AtPrepare_Locks(void);
-extern void PostPrepare_Locks(TransactionId xid);
+extern void TransferPrepare_Locks(TransactionId xid);
 extern int LockCheckConflicts(LockMethod lockMethodTable,
 				   LOCKMODE lockmode,
 				   LOCK *lock, PROCLOCK *proclock);
#18Marko Kreen
markokr@gmail.com
In reply to: Jan Wieck (#16)
Re: Problem with txid_snapshot_in/out() functionality

On Sun, Apr 13, 2014 at 05:46:20PM -0400, Jan Wieck wrote:

On 04/13/14 14:22, Jan Wieck wrote:

On 04/13/14 08:27, Marko Kreen wrote:

I think you need to do SET_VARSIZE also here. Alternative is to
move SET_VARSIZE after sort_snapshot().

And it seems the drop-double-txid logic should be added also to
txid_snapshot_recv(). It seems weird to have it behave differently
from txid_snapshot_in().

Thanks,

yes on both issues. Will create another patch.

New patch attached.

New github commit is https://github.com/wieck/postgres/commit/b8fd0d2eb78791e5171e34aecd233fd06218890d

Looks OK to me.

--
marko

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Andres Freund (#7)
Re: Problem with txid_snapshot_in/out() functionality

On 04/12/2014 05:03 PM, Andres Freund wrote:

On 2014-04-12 09:47:24 -0400, Tom Lane wrote:

Heikki Linnakangas <hlinnakangas@vmware.com> writes:

On 04/12/2014 12:07 AM, Jan Wieck wrote:

the Slony team has been getting seldom reports of a problem with the
txid_snapshot data type.
The symptom is that a txid_snapshot on output lists the same txid
multiple times in the xip list part of the external representation.

It's two-phase commit. When preparing a transaction, the state of the
transaction is first transfered to a dummy PGXACT entry, and then the
PGXACT entry of the backend is cleared. There is a transient state when
both PGXACT entries have the same xid.

Hm, yeah, but why is that intermediate state visible to anyone else?
Don't we have exclusive lock on the PGPROC array while we're doing this?

It's done outside the remit of ProcArray lock :(. And documented to lead
to duplicate xids in PGXACT.
EndPrepare():
/*
* Mark the prepared transaction as valid. As soon as xact.c marks
* MyPgXact as not running our XID (which it will do immediately after
* this function returns), others can commit/rollback the xact.
*
* NB: a side effect of this is to make a dummy ProcArray entry for the
* prepared XID. This must happen before we clear the XID from MyPgXact,
* else there is a window where the XID is not running according to
* TransactionIdIsInProgress, and onlookers would be entitled to assume
* the xact crashed. Instead we have a window where the same XID appears
* twice in ProcArray, which is OK.
*/
MarkAsPrepared(gxact);

It doesn't sound too hard to essentially move PrepareTransaction()'s
ProcArrayClearTransaction() into MarkAsPrepared() and rejigger the
locking to remove the intermediate state. But I think it'll lead to
bigger changes than we'd be comfortable backpatching.

Hmm. There's a field in GlobalTransactionData called locking_xid, which
is used to mark the XID of the transaction that's currently operating on
the prepared transaction. At prepare, that ensures that the transaction
cannot be committed or rolled back by another backend until the original
backend has cleared its PGPROC entry. At COMMIT/ROLLBACK PREPARED, it
ensures that only one backend can commit/rollback the transaction.

I wonder why we don't use a VirtualTransactionId there. AFAICS there is
no reason for COMMIT/ROLLBACK PREPARED to be assigned an XID of its own.
And if we used a VirtualTransactionId there, prepare could clear the xid
field of the PGPROC entry earlier.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20Andres Freund
andres@2ndquadrant.com
In reply to: Heikki Linnakangas (#19)
Re: Problem with txid_snapshot_in/out() functionality

On 2014-04-14 12:15:30 +0300, Heikki Linnakangas wrote:

Hmm. There's a field in GlobalTransactionData called locking_xid, which is
used to mark the XID of the transaction that's currently operating on the
prepared transaction. At prepare, that ensures that the transaction cannot
be committed or rolled back by another backend until the original backend
has cleared its PGPROC entry. At COMMIT/ROLLBACK PREPARED, it ensures that
only one backend can commit/rollback the transaction.

I wonder why we don't use a VirtualTransactionId there.

I wondered about that previously as well. My bet it's because the 2pc
support arrived before the virtualxact stuff...

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#21Tom Lane
tgl@sss.pgh.pa.us
In reply to: Heikki Linnakangas (#17)
Re: Race condition between PREPARE TRANSACTION and COMMIT PREPARED (was Re: Problem with txid_snapshot_in/out() functionality)

Heikki Linnakangas <hlinnakangas@vmware.com> writes:

I think we'll need to transfer of the locks earlier, before the
transaction is marked as fully prepared. I'll take a closer look at this
tomorrow.

Here's a patch to do that. It's very straightforward, I just moved the
calls to transfer locks earlier, before ProcArrayClearTransaction.
PostPrepare_MultiXact had a similar race - it also transfer state from
the old PGPROC entry to the new, and needs to be done before allowing
another backend to remove the new PGPROC entry. I changed the names of
the functions to distinguish them from the other PostPrepare_* functions
that now happen at a different time.

Why didn't you also move up PostPrepare_PredicateLocks? Seems like its
access to MySerializableXact is also racy.

The patch is simple, but it's a bit scary to change the order of things
like this.

Yeah. There are a lot of assumptions in there about the order of resource
release, in particular that it is safe to do certain things because we're
still holding locks.

I poked around a bit and noticed one theoretical problem sequence: if the
prepared xact drops some relation that we're still holding buffer pins on.
This shouldn't really happen (why are we still pinning some rel we think
we dropped?) but if it did, the commit would do DropRelFileNodeBuffers
which would end up busy-looping until we drop our pins (see
InvalidateBuffer, which thinks this must be an I/O wait situation).
So it would work, more or less, but it seems pretty fragile. I'm afraid
there are more assumptions like this one.

The whole thing feels like we are solving the wrong problem, anyway.
IIUC, the complaint arises because we are allowing COMMIT PREPARED
to occur before the source transaction has reported successful prepare
to its client. Surely that does not need to be a legal case? No
correctly-operating 2PC xact manager would do that.

I'd prefer to leave the prepare sequence alone and instead find a way
to reject COMMIT PREPARED until after the source transaction is safely
clear of the race conditions. The upthread idea of looking at vxid
instead of xid might help, except that I see we clear both of them
in ProcArrayClearTransaction. We'd need some state in PGPROC that
isn't cleared till later than that.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#22Andres Freund
andres@2ndquadrant.com
In reply to: Tom Lane (#21)
Re: Race condition between PREPARE TRANSACTION and COMMIT PREPARED (was Re: Problem with txid_snapshot_in/out() functionality)

On 2014-04-14 12:51:02 -0400, Tom Lane wrote:

The whole thing feels like we are solving the wrong problem, anyway.
IIUC, the complaint arises because we are allowing COMMIT PREPARED
to occur before the source transaction has reported successful prepare
to its client. Surely that does not need to be a legal case? No
correctly-operating 2PC xact manager would do that.

I agree here. This seems somewhat risky, just to support a case that
shouldn't happen in reality - as somewhat evidenced by the fact that
there don't seem to be field reports around this.

The upthread idea of looking at vxid
instead of xid might help, except that I see we clear both of them
in ProcArrayClearTransaction. We'd need some state in PGPROC that
isn't cleared till later than that.

I wonder if the most natural way to express this wouldn't be to have a
heavyweight lock for every 2pc xact
'slot'. ResourceOwnerRelease(RESOURCE_RELEASE_LOCKS) should be scheduled
correctly to make error handling for this work.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#23Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#22)
Re: Race condition between PREPARE TRANSACTION and COMMIT PREPARED (was Re: Problem with txid_snapshot_in/out() functionality)

Andres Freund <andres@2ndquadrant.com> writes:

I wonder if the most natural way to express this wouldn't be to have a
heavyweight lock for every 2pc xact
'slot'. ResourceOwnerRelease(RESOURCE_RELEASE_LOCKS) should be scheduled
correctly to make error handling for this work.

That seems like not a bad idea. Could we also use the same lock to
prevent concurrent attempts to commit/rollback the same already-prepared
transaction? I forget what we're doing to forestall such cases right now.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#24Andres Freund
andres@2ndquadrant.com
In reply to: Tom Lane (#23)
Re: Race condition between PREPARE TRANSACTION and COMMIT PREPARED (was Re: Problem with txid_snapshot_in/out() functionality)

On 2014-04-14 13:47:35 -0400, Tom Lane wrote:

Andres Freund <andres@2ndquadrant.com> writes:

I wonder if the most natural way to express this wouldn't be to have a
heavyweight lock for every 2pc xact
'slot'. ResourceOwnerRelease(RESOURCE_RELEASE_LOCKS) should be scheduled
correctly to make error handling for this work.

That seems like not a bad idea. Could we also use the same lock to
prevent concurrent attempts to commit/rollback the same already-prepared
transaction? I forget what we're doing to forestall such cases right now.

GlobalTransaction->locking_xid is currently used. If it points to a live
transaction by another backned "prepared transaction with identifier
\"%s\" is busy" will be thrown.
ISTM if there were using a lock for every slot, that logic couldbe
thrown away.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#25Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Tom Lane (#21)
Re: Race condition between PREPARE TRANSACTION and COMMIT PREPARED (was Re: Problem with txid_snapshot_in/out() functionality)

On 04/14/2014 07:51 PM, Tom Lane wrote:

I'd prefer to leave the prepare sequence alone and instead find a way
to reject COMMIT PREPARED until after the source transaction is safely
clear of the race conditions. The upthread idea of looking at vxid
instead of xid might help, except that I see we clear both of them
in ProcArrayClearTransaction. We'd need some state in PGPROC that
isn't cleared till later than that.

Hmm. What if one of the post-cleanup action fails? We can't bail out of
the prepare sequence until we have transfered the locks to the new
PGPROC. Otherwise the locks are lost. In essence, there should be a
critical section from the EndPrepare call until all the critical cleanup
actions like PostPrepare_Locks have been done, and I don't think we want
that. We might be able to guarantee that the built-in post-cleanup
operations are safe enough for that, but there's also CallXactCallbacks
in there.

Given the lack of reports of that happening, though, perhaps that's not
an issue.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#26Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Heikki Linnakangas (#25)
1 attachment(s)
Re: Race condition between PREPARE TRANSACTION and COMMIT PREPARED (was Re: Problem with txid_snapshot_in/out() functionality)

On 04/14/2014 09:48 PM, Heikki Linnakangas wrote:

On 04/14/2014 07:51 PM, Tom Lane wrote:

I'd prefer to leave the prepare sequence alone and instead find a way
to reject COMMIT PREPARED until after the source transaction is safely
clear of the race conditions. The upthread idea of looking at vxid
instead of xid might help, except that I see we clear both of them
in ProcArrayClearTransaction. We'd need some state in PGPROC that
isn't cleared till later than that.

Hmm. What if one of the post-cleanup action fails? We can't bail out of
the prepare sequence until we have transfered the locks to the new
PGPROC. Otherwise the locks are lost. In essence, there should be a
critical section from the EndPrepare call until all the critical cleanup
actions like PostPrepare_Locks have been done, and I don't think we want
that. We might be able to guarantee that the built-in post-cleanup
operations are safe enough for that, but there's also CallXactCallbacks
in there.

Given the lack of reports of that happening, though, perhaps that's not
an issue.

I came up with the attached fix for this. Currently, the entry is
implicitly considered dead or unlocked if the locking_xid transaction is
no longer active, but this patch essentially turns locking_xid into a
simple boolean, and makes it the backend's responsibility to clear it on
abort. (it's not actually a boolean, it's a BackendId, but that's just
for debugging purposes to track who's keeping the entry locked). This
requires a process exit hook, and an abort hook, to make sure the entry
is always released, but that's not too difficult. It allows the backend
to release the entry at exactly the right time, instead of having it
implicitly released by ProcArrayClearTransaction.

If we error during prepare, after having written the prepare WAL record
but before the locks have been transfered to the dummy PGPROC, the locks
are simply released. This is wrong, but it's always been like that and
we haven't heard any complaints of that from the field, so I'm inclined
to leave it as it is. We could use a critical section to force a panic,
but that cure could be a worse than the disease.

I considered Andres' idea of using a new heavy-weight lock, but didn't
like it much. It would be a larger patch, which is not nice for
back-patching. One issue would be that if you run out of lock memory,
you could not roll back any prepared transactions, which is not nice
because it could be a prepared transaction that's hoarding the lock memory.

- Heikki

Attachments:

fix-twophase-race-2.patchtext/x-diff; name=fix-twophase-race-2.patchDownload
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 66dbf58..995f51f 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -58,6 +58,7 @@
 #include "replication/walsender.h"
 #include "replication/syncrep.h"
 #include "storage/fd.h"
+#include "storage/ipc.h"
 #include "storage/predicate.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
@@ -82,25 +83,25 @@ int			max_prepared_xacts = 0;
  *
  * The lifecycle of a global transaction is:
  *
- * 1. After checking that the requested GID is not in use, set up an
- * entry in the TwoPhaseState->prepXacts array with the correct XID and GID,
- * with locking_xid = my own XID and valid = false.
+ * 1. After checking that the requested GID is not in use, set up an entry in
+ * the TwoPhaseState->prepXacts array with the correct GID and valid = false,
+ * and mark it as locked by my backend.
  *
  * 2. After successfully completing prepare, set valid = true and enter the
  * referenced PGPROC into the global ProcArray.
  *
- * 3. To begin COMMIT PREPARED or ROLLBACK PREPARED, check that the entry
- * is valid and its locking_xid is no longer active, then store my current
- * XID into locking_xid.  This prevents concurrent attempts to commit or
- * rollback the same prepared xact.
+ * 3. To begin COMMIT PREPARED or ROLLBACK PREPARED, check that the entry is
+ * valid and not locked, then mark the entry as locked by storing my current
+ * backend ID into locking_backend.  This prevents concurrent attempts to
+ * commit or rollback the same prepared xact.
  *
  * 4. On completion of COMMIT PREPARED or ROLLBACK PREPARED, remove the entry
  * from the ProcArray and the TwoPhaseState->prepXacts array and return it to
  * the freelist.
  *
  * Note that if the preparing transaction fails between steps 1 and 2, the
- * entry will remain in prepXacts until recycled.  We can detect recyclable
- * entries by checking for valid = false and locking_xid no longer active.
+ * entry must be removed so that the GID and the GlobalTransaction struct
+ * can be reused.  See AtAbort_Twophase().
  *
  * typedef struct GlobalTransactionData *GlobalTransaction appears in
  * twophase.h
@@ -115,8 +116,8 @@ typedef struct GlobalTransactionData
 	TimestampTz prepared_at;	/* time of preparation */
 	XLogRecPtr	prepare_lsn;	/* XLOG offset of prepare record */
 	Oid			owner;			/* ID of user that executed the xact */
-	TransactionId locking_xid;	/* top-level XID of backend working on xact */
-	bool		valid;			/* TRUE if fully prepared */
+	BackendId	locking_backend; /* backend currently working on the xact */
+	bool		valid;			/* TRUE if PGPROC entry is in proc array */
 	char		gid[GIDSIZE];	/* The GID assigned to the prepared xact */
 }	GlobalTransactionData;
 
@@ -141,6 +142,12 @@ typedef struct TwoPhaseStateData
 
 static TwoPhaseStateData *TwoPhaseState;
 
+/*
+ * Global transaction entry currently locked by us, if any.
+ */
+static GlobalTransaction MyLockedGxact = NULL;
+
+static bool twophaseExitRegistered = false;
 
 static void RecordTransactionCommitPrepared(TransactionId xid,
 								int nchildren,
@@ -157,6 +164,7 @@ static void RecordTransactionAbortPrepared(TransactionId xid,
 							   RelFileNode *rels);
 static void ProcessRecords(char *bufptr, TransactionId xid,
 			   const TwoPhaseCallback callbacks[]);
+static void RemoveGXact(GlobalTransaction gxact);
 
 
 /*
@@ -230,6 +238,66 @@ TwoPhaseShmemInit(void)
 		Assert(found);
 }
 
+/*
+ * Exit hook to unlock the global transaction entry we're working on.
+ */
+static void
+AtProcExit_Twophase(int code, Datum arg)
+{
+	/* same logic as abort */
+	AtAbort_Twophase();
+}
+
+/*
+ * Abort hook to unlock the global transaction entry we're working on.
+ */
+void
+AtAbort_Twophase(void)
+{
+	if (MyLockedGxact == NULL)
+		return;
+
+	/*
+	 * If we were in process of preparing the transaction, but haven't
+	 * written the WAL record yet, remove the global transaction entry.
+	 * Same if we are in the process of finishing an already-prepared
+	 * transaction, and fail after having already written the WAL 2nd
+	 * phase commit or rollback record.
+	 *
+	 * After that it's too late to abort, so just unlock the GlobalTransaction
+	 * entry.  We might not have transfered all locks and other state to the
+	 * prepared transaction yet, so this is a bit bogus, but it's the best we
+	 * can do.
+	 */
+	if (!MyLockedGxact->valid)
+	{
+		RemoveGXact(MyLockedGxact);
+	}
+	else
+	{
+		LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
+
+		MyLockedGxact->locking_backend = InvalidBackendId;
+
+		LWLockRelease(TwoPhaseStateLock);
+	}
+	MyLockedGxact = NULL;
+}
+
+/*
+ * This is called after we have finished transfering state to the prepared
+ * PGXACT entry.
+ */
+void
+PostPrepare_Twophase()
+{
+	LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
+	MyLockedGxact->locking_backend = InvalidBackendId;
+	LWLockRelease(TwoPhaseStateLock);
+
+	MyLockedGxact = NULL;
+}
+
 
 /*
  * MarkAsPreparing
@@ -261,29 +329,15 @@ MarkAsPreparing(TransactionId xid, const char *gid,
 				 errmsg("prepared transactions are disabled"),
 			  errhint("Set max_prepared_transactions to a nonzero value.")));
 
-	LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
-
-	/*
-	 * First, find and recycle any gxacts that failed during prepare. We do
-	 * this partly to ensure we don't mistakenly say their GIDs are still
-	 * reserved, and partly so we don't fail on out-of-slots unnecessarily.
-	 */
-	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
+	/* on first call, register the exit hook */
+	if (!twophaseExitRegistered)
 	{
-		gxact = TwoPhaseState->prepXacts[i];
-		if (!gxact->valid && !TransactionIdIsActive(gxact->locking_xid))
-		{
-			/* It's dead Jim ... remove from the active array */
-			TwoPhaseState->numPrepXacts--;
-			TwoPhaseState->prepXacts[i] = TwoPhaseState->prepXacts[TwoPhaseState->numPrepXacts];
-			/* and put it back in the freelist */
-			gxact->next = TwoPhaseState->freeGXacts;
-			TwoPhaseState->freeGXacts = gxact;
-			/* Back up index count too, so we don't miss scanning one */
-			i--;
-		}
+		before_shmem_exit(AtProcExit_Twophase, 0);
+		twophaseExitRegistered = true;
 	}
 
+	LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
+
 	/* Check for conflicting GID */
 	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
 	{
@@ -340,7 +394,7 @@ MarkAsPreparing(TransactionId xid, const char *gid,
 	/* initialize LSN to 0 (start of WAL) */
 	gxact->prepare_lsn = 0;
 	gxact->owner = owner;
-	gxact->locking_xid = xid;
+	gxact->locking_backend = MyBackendId;
 	gxact->valid = false;
 	strcpy(gxact->gid, gid);
 
@@ -348,6 +402,12 @@ MarkAsPreparing(TransactionId xid, const char *gid,
 	Assert(TwoPhaseState->numPrepXacts < max_prepared_xacts);
 	TwoPhaseState->prepXacts[TwoPhaseState->numPrepXacts++] = gxact;
 
+	/*
+	 * Remember that we have this GlobalTransaction entry locked for us.
+	 * If we abort after this, we must release the entry.
+	 */
+	MyLockedGxact = gxact;
+
 	LWLockRelease(TwoPhaseStateLock);
 
 	return gxact;
@@ -410,6 +470,13 @@ LockGXact(const char *gid, Oid user)
 {
 	int			i;
 
+	/* on first call, register the exit hook */
+	if (!twophaseExitRegistered)
+	{
+		before_shmem_exit(AtProcExit_Twophase, 0);
+		twophaseExitRegistered = true;
+	}
+
 	LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
 
 	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
@@ -424,15 +491,11 @@ LockGXact(const char *gid, Oid user)
 			continue;
 
 		/* Found it, but has someone else got it locked? */
-		if (TransactionIdIsValid(gxact->locking_xid))
-		{
-			if (TransactionIdIsActive(gxact->locking_xid))
-				ereport(ERROR,
-						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
-				errmsg("prepared transaction with identifier \"%s\" is busy",
-					   gid)));
-			gxact->locking_xid = InvalidTransactionId;
-		}
+		if (gxact->locking_backend != InvalidBackendId)
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("prepared transaction with identifier \"%s\" is busy",
+							gid)));
 
 		if (user != gxact->owner && !superuser_arg(user))
 			ereport(ERROR,
@@ -453,7 +516,8 @@ LockGXact(const char *gid, Oid user)
 					 errhint("Connect to the database where the transaction was prepared to finish it.")));
 
 		/* OK for me to lock it */
-		gxact->locking_xid = GetTopTransactionId();
+		gxact->locking_backend = MyBackendId;
+		MyLockedGxact = gxact;
 
 		LWLockRelease(TwoPhaseStateLock);
 
@@ -1089,6 +1153,13 @@ EndPrepare(GlobalTransaction gxact)
 	 */
 	MyPgXact->delayChkpt = false;
 
+	/*
+	 * Remember that we have this GlobalTransaction entry locked for us.  If
+	 * we crash after this point, it's too late to abort, but we must unlock
+	 * it so that the prepared transaction can be committed or rolled back.
+	 */
+	MyLockedGxact = gxact;
+
 	END_CRIT_SECTION();
 
 	/*
@@ -1335,8 +1406,9 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
 
 	/*
 	 * In case we fail while running the callbacks, mark the gxact invalid so
-	 * no one else will try to commit/rollback, and so it can be recycled
-	 * properly later.	It is still locked by our XID so it won't go away yet.
+	 * no one else will try to commit/rollback, and so it will be recycled
+	 * if we fail after this point.  It is still locked by our backend so it
+	 * won't go away yet.
 	 *
 	 * (We assume it's safe to do this without taking TwoPhaseStateLock.)
 	 */
@@ -1396,6 +1468,7 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
 	RemoveTwoPhaseFile(xid, true);
 
 	RemoveGXact(gxact);
+	MyLockedGxact = NULL;
 
 	pfree(buf);
 }
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 9ee11f3..ea8035f 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2268,6 +2268,8 @@ PrepareTransaction(void)
 						 RESOURCE_RELEASE_AFTER_LOCKS,
 						 true, true);
 
+	PostPrepare_Twophase();
+
 	/* Check we've released all catcache entries */
 	AtEOXact_CatCache(true);
 
@@ -2394,6 +2396,7 @@ AbortTransaction(void)
 	AtEOXact_LargeObject(false);
 	AtAbort_Notify();
 	AtEOXact_RelationMap(false);
+	AtAbort_Twophase();
 
 	/*
 	 * Advertise the fact that we aborted in pg_clog (assuming that we got as
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 9d29e35..80079b2 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -30,6 +30,9 @@ extern int	max_prepared_xacts;
 extern Size TwoPhaseShmemSize(void);
 extern void TwoPhaseShmemInit(void);
 
+extern void AtAbort_Twophase(void);
+extern void PostPrepare_Twophase(void);
+
 extern PGPROC *TwoPhaseGetDummyProc(TransactionId xid);
 extern BackendId TwoPhaseGetDummyBackendId(TransactionId xid);
 
#27Andres Freund
andres@2ndquadrant.com
In reply to: Heikki Linnakangas (#26)
Re: Race condition between PREPARE TRANSACTION and COMMIT PREPARED (was Re: Problem with txid_snapshot_in/out() functionality)

Hi,

On 2014-05-05 13:41:00 +0300, Heikki Linnakangas wrote:

I came up with the attached fix for this. Currently, the entry is implicitly
considered dead or unlocked if the locking_xid transaction is no longer
active, but this patch essentially turns locking_xid into a simple boolean,
and makes it the backend's responsibility to clear it on abort. (it's not
actually a boolean, it's a BackendId, but that's just for debugging purposes
to track who's keeping the entry locked). This requires a process exit hook,
and an abort hook, to make sure the entry is always released, but that's not
too difficult. It allows the backend to release the entry at exactly the
right time, instead of having it implicitly released by

I considered Andres' idea of using a new heavy-weight lock, but didn't like
it much. It would be a larger patch, which is not nice for back-patching.
One issue would be that if you run out of lock memory, you could not roll
back any prepared transactions, which is not nice because it could be a
prepared transaction that's hoarding the lock memory.

I am not convinced by the latter reasoning but you're right that any
such change would hardly be backpatchable.

+/*
+ * Exit hook to unlock the global transaction entry we're working on.
+ */
+static void
+AtProcExit_Twophase(int code, Datum arg)
+{
+	/* same logic as abort */
+	AtAbort_Twophase();
+}
+
+/*
+ * Abort hook to unlock the global transaction entry we're working on.
+ */
+void
+AtAbort_Twophase(void)
+{
+	if (MyLockedGxact == NULL)
+		return;
+
+	/*
+	 * If we were in process of preparing the transaction, but haven't
+	 * written the WAL record yet, remove the global transaction entry.
+	 * Same if we are in the process of finishing an already-prepared
+	 * transaction, and fail after having already written the WAL 2nd
+	 * phase commit or rollback record.
+	 *
+	 * After that it's too late to abort, so just unlock the GlobalTransaction
+	 * entry.  We might not have transfered all locks and other state to the
+	 * prepared transaction yet, so this is a bit bogus, but it's the best we
+	 * can do.
+	 */
+	if (!MyLockedGxact->valid)
+	{
+		RemoveGXact(MyLockedGxact);
+	}
+	else
+	{
+		LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
+
+		MyLockedGxact->locking_backend = InvalidBackendId;
+
+		LWLockRelease(TwoPhaseStateLock);
+	}
+	MyLockedGxact = NULL;
+}

Is it guaranteed that all paths have called LWLockReleaseAll()
before calling the proc exit hooks? Otherwise we might end up waiting
for ourselves...

/*
* MarkAsPreparing
@@ -261,29 +329,15 @@ MarkAsPreparing(TransactionId xid, const char *gid,
errmsg("prepared transactions are disabled"),
errhint("Set max_prepared_transactions to a nonzero value.")));

-	LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
-
-	/*
-	 * First, find and recycle any gxacts that failed during prepare. We do
-	 * this partly to ensure we don't mistakenly say their GIDs are still
-	 * reserved, and partly so we don't fail on out-of-slots unnecessarily.
-	 */
-	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
+	/* on first call, register the exit hook */
+	if (!twophaseExitRegistered)
{
-		gxact = TwoPhaseState->prepXacts[i];
-		if (!gxact->valid && !TransactionIdIsActive(gxact->locking_xid))
-		{
-			/* It's dead Jim ... remove from the active array */
-			TwoPhaseState->numPrepXacts--;
-			TwoPhaseState->prepXacts[i] = TwoPhaseState->prepXacts[TwoPhaseState->numPrepXacts];
-			/* and put it back in the freelist */
-			gxact->next = TwoPhaseState->freeGXacts;
-			TwoPhaseState->freeGXacts = gxact;
-			/* Back up index count too, so we don't miss scanning one */
-			i--;
-		}
+		before_shmem_exit(AtProcExit_Twophase, 0);
+		twophaseExitRegistered = true;
}

It's not particularly nice to register shmem exit hooks in the middle of
normal processing because it makes it impossible to use
cancel_before_shmem_exit() previously registered hooks. I think this
should be registered at startup, if max_prepared_xacts > 0.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#28Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Andres Freund (#27)
Re: Race condition between PREPARE TRANSACTION and COMMIT PREPARED (was Re: Problem with txid_snapshot_in/out() functionality)

On 05/06/2014 02:44 PM, Andres Freund wrote:

On 2014-05-05 13:41:00 +0300, Heikki Linnakangas wrote:

+/*
+ * Exit hook to unlock the global transaction entry we're working on.
+ */
+static void
+AtProcExit_Twophase(int code, Datum arg)
+{
+	/* same logic as abort */
+	AtAbort_Twophase();
+}
+
+/*
+ * Abort hook to unlock the global transaction entry we're working on.
+ */
+void
+AtAbort_Twophase(void)
+{
+	if (MyLockedGxact == NULL)
+		return;
+
+	/*
+	 * If we were in process of preparing the transaction, but haven't
+	 * written the WAL record yet, remove the global transaction entry.
+	 * Same if we are in the process of finishing an already-prepared
+	 * transaction, and fail after having already written the WAL 2nd
+	 * phase commit or rollback record.
+	 *
+	 * After that it's too late to abort, so just unlock the GlobalTransaction
+	 * entry.  We might not have transfered all locks and other state to the
+	 * prepared transaction yet, so this is a bit bogus, but it's the best we
+	 * can do.
+	 */
+	if (!MyLockedGxact->valid)
+	{
+		RemoveGXact(MyLockedGxact);
+	}
+	else
+	{
+		LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
+
+		MyLockedGxact->locking_backend = InvalidBackendId;
+
+		LWLockRelease(TwoPhaseStateLock);
+	}
+	MyLockedGxact = NULL;
+}

Is it guaranteed that all paths have called LWLockReleaseAll()
before calling the proc exit hooks? Otherwise we might end up waiting
for ourselves...

Hmm. AbortTransaction() will release locks before we get here, but the
before_shmem_exit() callpath will not. So an elog(FATAL), while holding
TwoPhaseStateLock would cause us to deadlock with ourself. But there are
no such elogs.

I copied this design from async.c, which is quite similar, so if there's
a problem that ought to be fixed too. And there are other more
complicated before_shmem callbacks that worry me more, like
createdb_failure_callback(). But I think they're all all right.

/*
* MarkAsPreparing
@@ -261,29 +329,15 @@ MarkAsPreparing(TransactionId xid, const char *gid,
errmsg("prepared transactions are disabled"),
errhint("Set max_prepared_transactions to a nonzero value.")));

-	LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
-
-	/*
-	 * First, find and recycle any gxacts that failed during prepare. We do
-	 * this partly to ensure we don't mistakenly say their GIDs are still
-	 * reserved, and partly so we don't fail on out-of-slots unnecessarily.
-	 */
-	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
+	/* on first call, register the exit hook */
+	if (!twophaseExitRegistered)
{
-		gxact = TwoPhaseState->prepXacts[i];
-		if (!gxact->valid && !TransactionIdIsActive(gxact->locking_xid))
-		{
-			/* It's dead Jim ... remove from the active array */
-			TwoPhaseState->numPrepXacts--;
-			TwoPhaseState->prepXacts[i] = TwoPhaseState->prepXacts[TwoPhaseState->numPrepXacts];
-			/* and put it back in the freelist */
-			gxact->next = TwoPhaseState->freeGXacts;
-			TwoPhaseState->freeGXacts = gxact;
-			/* Back up index count too, so we don't miss scanning one */
-			i--;
-		}
+		before_shmem_exit(AtProcExit_Twophase, 0);
+		twophaseExitRegistered = true;
}

It's not particularly nice to register shmem exit hooks in the middle of
normal processing because it makes it impossible to use
cancel_before_shmem_exit() previously registered hooks. I think this
should be registered at startup, if max_prepared_xacts > 0.

<shrug>. async.c and namespace.c does the same, and it hasn't been a
problem.

I committed this now, but please let me know if you see a concrete
problem with the locks.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#29Andres Freund
andres@2ndquadrant.com
In reply to: Heikki Linnakangas (#28)
Re: Race condition between PREPARE TRANSACTION and COMMIT PREPARED (was Re: Problem with txid_snapshot_in/out() functionality)

On 2014-05-15 17:21:28 +0300, Heikki Linnakangas wrote:

Is it guaranteed that all paths have called LWLockReleaseAll()
before calling the proc exit hooks? Otherwise we might end up waiting
for ourselves...

Hmm. AbortTransaction() will release locks before we get here, but the
before_shmem_exit() callpath will not. So an elog(FATAL), while holding
TwoPhaseStateLock would cause us to deadlock with ourself. But there are no
such elogs.

I copied this design from async.c, which is quite similar, so if there's a
problem that ought to be fixed too. And there are other more complicated
before_shmem callbacks that worry me more, like createdb_failure_callback().
But I think they're all all right.

Perhaps we should enforce that LWLockReleaseAll() is called first?
E.g. in shmem_exit()? It'll happen in ProcKill() atm, but that's
normally pretty much at the bottom of the stack.

It's not particularly nice to register shmem exit hooks in the middle of
normal processing because it makes it impossible to use
cancel_before_shmem_exit() previously registered hooks. I think this
should be registered at startup, if max_prepared_xacts > 0.

<shrug>. async.c and namespace.c does the same, and it hasn't been a
problem.

Well, it doesn't seem unreasonable to have C code using
PG_ENSURE_ERROR_CLEANUP/PG_END_ENSURE_ERROR_CLEANUP around a 2pc commit
to me. That'll break with this.
Perhaps we should just finally make cancel_before_shmem_exit search the
stack of callbacks.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#30Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Marko Kreen (#18)
Re: Problem with txid_snapshot_in/out() functionality

On 04/14/2014 11:55 AM, Marko Kreen wrote:

On Sun, Apr 13, 2014 at 05:46:20PM -0400, Jan Wieck wrote:

On 04/13/14 14:22, Jan Wieck wrote:

On 04/13/14 08:27, Marko Kreen wrote:

I think you need to do SET_VARSIZE also here. Alternative is to
move SET_VARSIZE after sort_snapshot().

And it seems the drop-double-txid logic should be added also to
txid_snapshot_recv(). It seems weird to have it behave differently
from txid_snapshot_in().

Thanks,

yes on both issues. Will create another patch.

New patch attached.

New github commit is https://github.com/wieck/postgres/commit/b8fd0d2eb78791e5171e34aecd233fd06218890d

Looks OK to me.

Ok, committed.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#31Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#29)
Re: Race condition between PREPARE TRANSACTION and COMMIT PREPARED (was Re: Problem with txid_snapshot_in/out() functionality)

On Thu, May 15, 2014 at 10:38 AM, Andres Freund <andres@2ndquadrant.com> wrote:

<shrug>. async.c and namespace.c does the same, and it hasn't been a
problem.

Well, it doesn't seem unreasonable to have C code using
PG_ENSURE_ERROR_CLEANUP/PG_END_ENSURE_ERROR_CLEANUP around a 2pc commit
to me. That'll break with this.
Perhaps we should just finally make cancel_before_shmem_exit search the
stack of callbacks.

Yes, please. And while we're at it, perhaps we should make it Trap()
or fail an Assert() if it doesn't find the callback it was told to
remove.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers