nested transactions

Started by Bruce Momjianover 23 years ago38 messageshackers
Jump to latest
#1Bruce Momjian
bruce@momjian.us

I am going to work on nested transactions for 7.4.

My goal is to first implement nested transactions:

BEGIN;
SELECT ...
BEGIN;
UPDATE;
COMMIT;
DELETE;
COMMIT;

and later savepoints (Oracle):

BEGIN;
SELECT ...
SAVEPOINT t1;
UPDATE;
SAVEPOINT t2;
DELETE;
ROLLBACK TO SAVEPOINT t2;
COMMIT;

I assume people want both.

As an implementation, I will hack xact.c to create a transaction status
stack so when you do a BEGIN inside a transaction, it saves the
transaction status, the transaction block status, and perhaps the
command counter. A COMMIT restores these values.

I also plan to modify the on commit/abort actions. On a subtransaction
commit, little has to be done, but on an ABORT, you must execute any
abort actions required by that subtransaction _and_ remove any on commit
actions for the subtransaction. There will need to be some code
reorganization because some on commit/abort activity assumes only one
transaction can be in process. A stack will need to be added in those
cases.

And finally, I must abort tuple changes made by the aborted
subtransaction. One way of doing that is to keep all relation id's
modified by the transaction, and do a sequential scan of the tables on
abort, changing the transaction id's to a fixed aborted transaction id.
However, this could be slow. (We could store tids if only a few rows
are updated by a subtransaction. That would speed it up considerably.)

Another idea is to use new transaction id's for the subtransactions, and
update the transaction id status in pg_clog for the subtransactions, so
that there is no transaction id renumbering required. One problem with
this is the requirement of updating all the clog transaction id statuses
atomically. One way to do that would be to do parent/child dependency
in clog so that if a child is looked up and it is marked as "in
process", a check could be done against the parent. Once the outer
transaction is committed/aborted, those subtransactions could be updated
so there would be no need to reference the parent any longer. This
would increase the clog size per transaction from 2 bits to 4 bytes
(two bits for status, 30 bits for offset to parent).

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#1)
Re: nested transactions

Bruce Momjian <pgman@candle.pha.pa.us> writes:

I am going to work on nested transactions for 7.4.
[some details]

This is, of course, barely scratching the surface of what will need to
be done.

I assume you've abandoned the notion of a fast release cycle for 7.4?
'Cause if you start on this, we ain't releasing any time soon ...

regards, tom lane

#3Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#2)
Re: nested transactions

Tom Lane wrote:

Bruce Momjian <pgman@candle.pha.pa.us> writes:

I am going to work on nested transactions for 7.4.
[some details]

This is, of course, barely scratching the surface of what will need to
be done.

I assume you've abandoned the notion of a fast release cycle for 7.4?
'Cause if you start on this, we ain't releasing any time soon ...

Abandoned because of the delay in Win32 (end of Dec), PITR (not being
worked on), and mostly because very few wanted a short release cycle.

I will keep the transaction changes private to my tree, so if I can't
get it done, I will just keep it for the next release.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#4Scott Lamb
slamb@slamb.org
In reply to: Bruce Momjian (#1)
Re: nested transactions

Bruce Momjian wrote:

I am going to work on nested transactions for 7.4.

If you're going to do a lot of reworking of how transactions are
handled, maybe this is a good time to beg for cursors that stay open
across commits. It looks like the JDBC driver is moving to using cursors
with ResultSet.CLOSE_CURSORS_AT_COMMIT, for the advantage of not having
to fetch the entire result immediately and hold it in memory. If this
were implemented, the same could be done for
ResultSet.HOLD_CURSORS_OVER_COMMIT, which I think a lot of JDBC code needs.

Thanks,
Scott

#5snpe
snpe@snpe.co.yu
In reply to: Scott Lamb (#4)
Re: nested transactions

On Friday 22 November 2002 04:36 pm, Scott Lamb wrote:

Bruce Momjian wrote:

I am going to work on nested transactions for 7.4.

If you're going to do a lot of reworking of how transactions are
handled, maybe this is a good time to beg for cursors that stay open
across commits. It looks like the JDBC driver is moving to using cursors
with ResultSet.CLOSE_CURSORS_AT_COMMIT, for the advantage of not having
to fetch the entire result immediately and hold it in memory. If this
were implemented, the same could be done for
ResultSet.HOLD_CURSORS_OVER_COMMIT, which I think a lot of JDBC code needs.

I agree.It is my favorite features - and if you set savepoint I think that stay first solution
(begin; ... ; begin; ...; begin; ...;comit; ...;commit;...; commit;

Thanks
Haris Peco

#6Manfred Koizar
mkoi-pg@aon.at
In reply to: Bruce Momjian (#1)
Re: nested transactions

On Fri, 22 Nov 2002 00:32:46 -0500 (EST), Bruce Momjian
<pgman@candle.pha.pa.us> wrote:

I am going to work on nested transactions for 7.4.
[...]
And finally, I must abort tuple changes made by the aborted
subtransaction. One way of doing that is to keep all relation id's
modified by the transaction, and do a sequential scan of the tables on
abort, changing the transaction id's to a fixed aborted transaction id.
However, this could be slow. (We could store tids if only a few rows
are updated by a subtransaction. That would speed it up considerably.)

Depends on your definition of "few". I don't expect problems for up
to several thousand tids. If there are more modified tuples, we could
first reduce the list to page numbers, before finally falling back to
table scans.

Another idea is to use new transaction id's for the subtransactions, and
[...]
would increase the clog size per transaction from 2 bits to 4 bytes
(two bits for status, 30 bits for offset to parent).

Nice idea, this 30 bit offset. But one could argue that increased
clog size even hurts users who don't use nested transactions at all.
If parent/child dependency is kept separate from status bits (in
pg_subtransxxxx files), additional I/O cost is only paid if
subtransactions are actually used. New status bits (XMIN_IS_SUB,
XMAX_IS_SUB) in tuple headers can avoid unnecessary parent xid
lookups.

I also thought of subtransaction xids in tuple headers as short lived
information. Under certain conditions they can be replaced with the
parent xid as soon as the parent transaction has finished. I proposed
this to be done on the next tuple access just like we set
committed/aborted flags now, though I'm not sure anymore that it is
safe to do this.

Old pg_subtrans files can be removed by VACUUM.

One more difference between the two proposals: The former (locally
remember modified tuples) can be used for recovery after a failed
command. The latter (subtrans tree) can only help, if we give a new
xid to each command, which I'm sure we don't want to do.

Servus
Manfred

#7Ken Hirsch
kahirsch@bellsouth.net
In reply to: Bruce Momjian (#1)
Re: nested transactions

From: "Bruce Momjian" <pgman@candle.pha.pa.us>

And finally, I must abort tuple changes made by the aborted
subtransaction. One way of doing that is to keep all relation id's
modified by the transaction, and do a sequential scan of the tables on
abort, changing the transaction id's to a fixed aborted transaction id.
However, this could be slow. (We could store tids if only a few rows
are updated by a subtransaction. That would speed it up considerably.)

Are you sure you don't want to use the log for this? It does mean that the
log can grow without bound for long-lived transactions, but it's very
straightforward and fast.

Ken Hirsch

#8Bruce Momjian
bruce@momjian.us
In reply to: Manfred Koizar (#6)
Re: nested transactions

Manfred Koizar wrote:

On Fri, 22 Nov 2002 00:32:46 -0500 (EST), Bruce Momjian
<pgman@candle.pha.pa.us> wrote:

I am going to work on nested transactions for 7.4.
[...]
And finally, I must abort tuple changes made by the aborted
subtransaction. One way of doing that is to keep all relation id's
modified by the transaction, and do a sequential scan of the tables on
abort, changing the transaction id's to a fixed aborted transaction id.
However, this could be slow. (We could store tids if only a few rows
are updated by a subtransaction. That would speed it up considerably.)

Depends on your definition of "few". I don't expect problems for up
to several thousand tids. If there are more modified tuples, we could
first reduce the list to page numbers, before finally falling back to
table scans.

Yes, and the key point is that those are kept only in the backend local
memory, so clearly thousands are possible. The outer transaction takes
care of all the ACID issues.

Another idea is to use new transaction id's for the subtransactions, and
[...]
would increase the clog size per transaction from 2 bits to 4 bytes
(two bits for status, 30 bits for offset to parent).

Nice idea, this 30 bit offset. But one could argue that increased
clog size even hurts users who don't use nested transactions at all.
If parent/child dependency is kept separate from status bits (in
pg_subtransxxxx files), additional I/O cost is only paid if
subtransactions are actually used. New status bits (XMIN_IS_SUB,
XMAX_IS_SUB) in tuple headers can avoid unnecessary parent xid
lookups.

I also thought of subtransaction xids in tuple headers as short lived
information. Under certain conditions they can be replaced with the
parent xid as soon as the parent transaction has finished. I proposed
this to be done on the next tuple access just like we set
committed/aborted flags now, though I'm not sure anymore that it is
safe to do this.

Old pg_subtrans files can be removed by VACUUM.

One more difference between the two proposals: The former (locally
remember modified tuples) can be used for recovery after a failed
command. The latter (subtrans tree) can only help, if we give a new
xid to each command, which I'm sure we don't want to do.

The interesting issue is that if we could set the commit/abort bits all
at the same time, we could have the parent/child dependency local to the
backend --- other backends don't need to know the parent, only the
status of the (subtransaction's) xid, and they need to see all those
xid's committed at the same time.

You could store the backend slot id in pg_clog rather than the parent
xid and look up the status of the outer xid for that backend slot. That
would allow you to use 2 bytes, with a max of 16k backends. The problem
is that on a crash, the pg_clog points to invalid slots --- it would
probably have to be cleaned up on startup.

But still, you have an interesting idea of just setting the bit to be "I
am a child". The trick is allowing backends to figure out who's child
you are. We could store this somehow in shared memory, but that is
finite and there can be lots of xid's for a backend using
subtransactions.

I still think there must be a clean way, but I haven't figured it out yet.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#9Bruce Momjian
bruce@momjian.us
In reply to: Ken Hirsch (#7)
Re: nested transactions

Ken Hirsch wrote:

From: "Bruce Momjian" <pgman@candle.pha.pa.us>

And finally, I must abort tuple changes made by the aborted
subtransaction. One way of doing that is to keep all relation id's
modified by the transaction, and do a sequential scan of the tables on
abort, changing the transaction id's to a fixed aborted transaction id.
However, this could be slow. (We could store tids if only a few rows
are updated by a subtransaction. That would speed it up considerably.)

Are you sure you don't want to use the log for this? It does mean that the
log can grow without bound for long-lived transactions, but it's very
straightforward and fast.

I don't think we want to have unlimited log file growth for long running
transactions/subtransactions.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#10Hans-Jürgen Schönig
postgres@cybertec.at
In reply to: Bruce Momjian (#9)
Re: nested transactions

Is there going to be a way to use transactions inside transactions of
transactions?
In other words:

BEGIN;
BEGIN;
BEGIN;
BEGIN;

COMMIT;
COMMIT;
COMMIT;
COMMIT;

Is there a way to have some sort of recursive solution with every
transaction but the first one being a child transaction?
Is there a way to implement that without too much extra effort?
I just curious how that could be done.

Hans

#11Manfred Koizar
mkoi-pg@aon.at
In reply to: Bruce Momjian (#8)
Re: nested transactions

On Wed, 27 Nov 2002 22:47:33 -0500 (EST), Bruce Momjian
<pgman@candle.pha.pa.us> wrote:

The interesting issue is that if we could set the commit/abort bits all
at the same time, we could have the parent/child dependency local to the
backend --- other backends don't need to know the parent, only the
status of the (subtransaction's) xid, and they need to see all those
xid's committed at the same time.

You mean the commit/abort bit in the tuple headers? Yes, this would
be interesting, but I see no way how this could be done. If it could,
there would be no need for pg_clog.

Reading your paragraph above one more time I think you mean the bits
in pg_clog: Each subtransaction gets its own xid. On ROLLBACK the
abort bits of the aborted (sub)transaction and all its children are
set in pg_clog immediately. This operation does not have to be
atomic. On subtransaction COMMIT nothing happens to pg_clog, the
status is only changed locally, the subtransaction still looks "in
progress" to other backends. Only when the main transaction commits,
we set the commit bits of the main transaction and all its non-aborted
children in pg_clog. This action has to be atomic. Right?

AFAICS the problem lies in updating several pg_clog bits at once. How
can this be done without holding a potentially long lasting lock?

You could store the backend slot id in pg_clog rather than the parent
xid and look up the status of the outer xid for that backend slot. That
would allow you to use 2 bytes, with a max of 16k backends. The problem
is that on a crash, the pg_clog points to invalid slots --- it would
probably have to be cleaned up on startup.

Again I would try to keep pg_clog compact and store the backend slots
in another file, thus not slowing down instances where subtransactions
are nor used. Apart from this minor detail I don't see, how this is
supposed to work. Could you elaborate?

But still, you have an interesting idea of just setting the bit to be "I
am a child".

The idea was to set subtransaction bits in the tuple header. Here is
yet another different idea: Let the currently unused fourth state in
pg_clog indicate a committed subtransaction. There are two bits per
transaction, commit and abort, with the following meaning:

a c
0 0 transaction in progress, the owning backend knows whether it is
a main- or a sub-transaction, other backends don't care
1 0 aborted, nobody cares whether main- or sub-transaction
0 1 committed main-transaction (*)
1 1 committed sub-transaction, have to look for parent in
pg_subtrans

If we allow the 1/1 state to be replaced with 0/1 or 1/0 (on the fly
as a side effect of a visibility check, or by vacuum, or by
COMMIT/ROLLBACK), this could save a lot of parent lookups without
having to touch the xids in the tuple headers.

So (*) should read: committed main-transaction or committed
sub-transaction having a committed parent.

The trick is allowing backends to figure out who's child
you are. We could store this somehow in shared memory, but that is
finite and there can be lots of xid's for a backend using
subtransactions.

The subtrans dependencies have to be visible to all backends. Store
them to disk just like pg_clog. In older proposals I spoke of a
pg_subtrans "table" containing (parent, child) pairs. This was only
meant as a concept, not as a real SQL table subject to MVCC. An
efficient(?) implementation could be an array of parent xids, indexed
by child xid. Most of it can be stolen from the clog code.

One more argument for pg_subtrans being visible to all backends: If
an UPDATE is about to change a tuple touched by another active
transaction, it waits for the other transaction to commit or abort.
We must always wait for the main transaction, not the subtrans.

I still think there must be a clean way,

I hope so ...

but I haven't figured it out yet.

Are we getting nearer?

Servus
Manfred

#12Bruce Momjian
bruce@momjian.us
In reply to: Hans-Jürgen Schönig (#10)
Re: nested transactions

Hans-J���rgen Sch���nig wrote:

Is there going to be a way to use transactions inside transactions of
transactions?
In other words:

BEGIN;
BEGIN;
BEGIN;
BEGIN;

COMMIT;
COMMIT;
COMMIT;
COMMIT;

Is there a way to have some sort of recursive solution with every
transaction but the first one being a child transaction?
Is there a way to implement that without too much extra effort?
I just curious how that could be done.

Sure, nesting will be unlimited.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#13Bruce Momjian
bruce@momjian.us
In reply to: Manfred Koizar (#11)
Re: nested transactions

Manfred Koizar wrote:

On Wed, 27 Nov 2002 22:47:33 -0500 (EST), Bruce Momjian
<pgman@candle.pha.pa.us> wrote:

The interesting issue is that if we could set the commit/abort bits all
at the same time, we could have the parent/child dependency local to the
backend --- other backends don't need to know the parent, only the
status of the (subtransaction's) xid, and they need to see all those
xid's committed at the same time.

You mean the commit/abort bit in the tuple headers? Yes, this would
be interesting, but I see no way how this could be done. If it could,
there would be no need for pg_clog.

Reading your paragraph above one more time I think you mean the bits
in pg_clog: Each subtransaction gets its own xid. On ROLLBACK the

Right.

abort bits of the aborted (sub)transaction and all its children are
set in pg_clog immediately. This operation does not have to be
atomic. On subtransaction COMMIT nothing happens to pg_clog, the

Right, going from RUNNING to ABORTED doesn't have to be atomic because
both tuples are invisible.

status is only changed locally, the subtransaction still looks "in
progress" to other backends. Only when the main transaction commits,
we set the commit bits of the main transaction and all its non-aborted
children in pg_clog. This action has to be atomic. Right?

Right. We can't have some backends looking at part of the transaction
as committed while at the same time other backends see the transaction
as in process.

AFAICS the problem lies in updating several pg_clog bits at once. How
can this be done without holding a potentially long lasting lock?

Yes, locking is one possible solution, but no one likes that. One hack
lock idea would be to create a subtransaction-only lock, so if you see
the special 4-th xact state (about to be committed as part of a
subtransaction) you have to wait on that lock (held by the backend
twiddling the xact bits), then look again. That basically would
serialize all the bit-twiddling for subtransactions. I am sure I am
going to get a "yuck" from the audience on that one, but I am not sure
how long that bit twiddling could take. Does xact twiddle every cause
I/O? I think it could, which would be a pretty big performance problem.
It would serialize the subtransaction commits _and_ block anyone trying
to get the status of those subtransactions. We would not use the the
4th xid status during the transaction, only while we were twiddling the
bits on commit.

You could store the backend slot id in pg_clog rather than the parent
xid and look up the status of the outer xid for that backend slot. That
would allow you to use 2 bytes, with a max of 16k backends. The problem
is that on a crash, the pg_clog points to invalid slots --- it would
probably have to be cleaned up on startup.

Again I would try to keep pg_clog compact and store the backend slots
in another file, thus not slowing down instances where subtransactions
are nor used. Apart from this minor detail I don't see, how this is
supposed to work. Could you elaborate?

The trick is that when that 4th status is set, backends looking up the
status all need to point to a central location that can be set for all
of them at once, hence the original idea of putting the parent xid in
the clog file. We don't _need_ to do that, but we do need a way to
_point_ to a central location where the status can be looked up.

But still, you have an interesting idea of just setting the bit to be "I
am a child".

The idea was to set subtransaction bits in the tuple header. Here is
yet another different idea: Let the currently unused fourth state in
pg_clog indicate a committed subtransaction. There are two bits per
transaction, commit and abort, with the following meaning:

a c
0 0 transaction in progress, the owning backend knows whether it is
a main- or a sub-transaction, other backends don't care
1 0 aborted, nobody cares whether main- or sub-transaction
0 1 committed main-transaction (*)
1 1 committed sub-transaction, have to look for parent in
pg_subtrans

If we allow the 1/1 state to be replaced with 0/1 or 1/0 (on the fly
as a side effect of a visibility check, or by vacuum, or by
COMMIT/ROLLBACK), this could save a lot of parent lookups without
having to touch the xids in the tuple headers.

Yes, you could do that, but we can easily just set the clog bits
atomically, and it will not be needed --- the tuple bits really don't
help us, I think.

So (*) should read: committed main-transaction or committed
sub-transaction having a committed parent.

The trick is allowing backends to figure out who's child
you are. We could store this somehow in shared memory, but that is
finite and there can be lots of xid's for a backend using
subtransactions.

The subtrans dependencies have to be visible to all backends. Store
them to disk just like pg_clog. In older proposals I spoke of a
pg_subtrans "table" containing (parent, child) pairs. This was only
meant as a concept, not as a real SQL table subject to MVCC. An
efficient(?) implementation could be an array of parent xids, indexed
by child xid. Most of it can be stolen from the clog code.

OK, we put it in a file. And how do we efficiently clean it up?
Remember, it is only to be used for a _brief_ period of time. I think a
file system solution is doable if we can figure out a way not to create
a file for every xid. Do we spin through the files (one per outer
transaction) looking for a matching xid when we see that 4th xact state?

Maybe we write the xid's to a file in a special directory in sorted
order, and backends can do a btree search of each file in that directory
looking for the xid, and then knowing the master xid, look up that
status, and once all the children xid's are updated, you delete the
file.

One more argument for pg_subtrans being visible to all backends: If
an UPDATE is about to change a tuple touched by another active
transaction, it waits for the other transaction to commit or abort.
We must always wait for the main transaction, not the subtrans.

Yes, but again, the xid status of subtransactions is only update just
before commit of the main transaction, so there is little value to
having those visible.

Let's keep going!

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#14Bruce Momjian
bruce@momjian.us
In reply to: Bruce Momjian (#13)
Re: nested transactions

pgman wrote:

AFAICS the problem lies in updating several pg_clog bits at once. How
can this be done without holding a potentially long lasting lock?

Yes, locking is one possible solution, but no one likes that. One hack
lock idea would be to create a subtransaction-only lock, so if you see
the special 4-th xact state (about to be committed as part of a
subtransaction) you have to wait on that lock (held by the backend
twiddling the xact bits), then look again. That basically would
serialize all the bit-twiddling for subtransactions. I am sure I am
going to get a "yuck" from the audience on that one, but I am not sure
how long that bit twiddling could take. Does xact twiddle every cause
I/O? I think it could, which would be a pretty big performance problem.
It would serialize the subtransaction commits _and_ block anyone trying
to get the status of those subtransactions. We would not use the the
4th xid status during the transaction, only while we were twiddling the
bits on commit.

Let me correct this. Transaction state readers _don't_ have to block
while the subtransaction is twiddling bits. Logic would be:

Set all aborted subtransaction status bits
Grab subxact lock
Set subxact global status bit to in progress
Set all subtransaction xids to SUBXACT_TO_COMMIT
Set subxact global status bit to committed (commit happens here)
Set all SUBXACT_TO_COMMIT xids to COMMITTED
Release subxact lock

Any transaction looking up a subtransaction that has an
SUBXACT_TO_COMMIT state has to consult the global subxact status bit,
which is a global variable in shared memory.

What this basically does is to funnel all subxid lookups into a single
global subxid status bit. In fact, even the outer transaction has to be
set to SUBXACT_TO_COMMIT so it commits at the same time as the
subtransactions.

On crash recovery, all SUBXACT_TO_COMMIT have to be cleaned up, somehow,
perhaps using WAL.

The only downside to this approach is that subtransaction bit twiddling
is serialized.

This is an interesting idea becuase it has overhead only to backends
using subtransactions. It does kill our "multiple commits at the same
time" that we can do with normal transactions.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#15Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#13)
Re: nested transactions

Bruce Momjian <pgman@candle.pha.pa.us> writes:

Yes, locking is one possible solution, but no one likes that. One hack
lock idea would be to create a subtransaction-only lock, so if you see
the special 4-th xact state (about to be committed as part of a
subtransaction) you have to wait on that lock (held by the backend
twiddling the xact bits), then look again. That basically would
serialize all the bit-twiddling for subtransactions. I am sure I am
going to get a "yuck" from the audience on that one,

You sure are.

but I am not sure
how long that bit twiddling could take. Does xact twiddle every cause
I/O?

Yes, if the page of pg_clog you need to touch is not currently in a
buffer. With a large transaction you might have hundreds of
subtransactions, which could take an unpleasantly long time to mark
all committed.

What's worse, I think the above proposal requires a *single* lock for
this purpose (if there's more than one, how shall the requestor know
which one to block on?) --- so you are serializing all transaction
commits that have subtransactions, with only one able to go through at
a time. That will really, really not do; the performance will be way
worse than the chaining idea we discussed before.

You could store the backend slot id in pg_clog rather than the parent
xid and look up the status of the outer xid for that backend slot. That
would allow you to use 2 bytes, with a max of 16k backends.

This is also a bad idea, because backend slot ids are not stable (by the
time you look in PG_PROC, the slot may be occupied by a new, unrelated
backend process).

But still, you have an interesting idea of just setting the bit to be "I
am a child".

That bit alone doesn't help; you need to know *whose* child.

AFAICS, the objection to putting parent xact IDs into pg_clog is
basically a performance issue: bigger clog means more I/O. This is
surely true; but the alternatives proposed so far are worse.

regards, tom lane

#16Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#15)
Re: nested transactions

I should add that I am not prepared to overhaul the pg_clog file format
as part of adding subtransactions for 7.4. I can do the tid/sequential scan
method for abort, or the single-lock method described.

---------------------------------------------------------------------------

Tom Lane wrote:

Bruce Momjian <pgman@candle.pha.pa.us> writes:

Yes, locking is one possible solution, but no one likes that. One hack
lock idea would be to create a subtransaction-only lock, so if you see
the special 4-th xact state (about to be committed as part of a
subtransaction) you have to wait on that lock (held by the backend
twiddling the xact bits), then look again. That basically would
serialize all the bit-twiddling for subtransactions. I am sure I am
going to get a "yuck" from the audience on that one,

You sure are.

but I am not sure
how long that bit twiddling could take. Does xact twiddle every cause
I/O?

Yes, if the page of pg_clog you need to touch is not currently in a
buffer. With a large transaction you might have hundreds of
subtransactions, which could take an unpleasantly long time to mark
all committed.

What's worse, I think the above proposal requires a *single* lock for
this purpose (if there's more than one, how shall the requestor know
which one to block on?) --- so you are serializing all transaction
commits that have subtransactions, with only one able to go through at
a time. That will really, really not do; the performance will be way
worse than the chaining idea we discussed before.

You could store the backend slot id in pg_clog rather than the parent
xid and look up the status of the outer xid for that backend slot. That
would allow you to use 2 bytes, with a max of 16k backends.

This is also a bad idea, because backend slot ids are not stable (by the
time you look in PG_PROC, the slot may be occupied by a new, unrelated
backend process).

But still, you have an interesting idea of just setting the bit to be "I
am a child".

That bit alone doesn't help; you need to know *whose* child.

AFAICS, the objection to putting parent xact IDs into pg_clog is
basically a performance issue: bigger clog means more I/O. This is
surely true; but the alternatives proposed so far are worse.

regards, tom lane

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#17Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#16)
Re: nested transactions

Bruce Momjian <pgman@candle.pha.pa.us> writes:

I should add that I am not prepared to overhaul the pg_clog file format
as part of adding subtransactions for 7.4. I can do the tid/sequential scan
method for abort, or the single-lock method described.

If you think that changing the pg_clog file format would be harder than
either of those other ideas, I think you're very badly mistaken.
pg_clog is touched only by one rather simple module.

I think the other methods will be completely unacceptable from a
performance point of view. They could maybe work if subtransactions
were a seldom-used feature; but the people who want to use 'em are
mostly talking about a subtransaction for *every* command. If you
design your implementation on the assumption that subtransactions are
infrequent, it will be unusably slow.

regards, tom lane

#18Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#17)
Re: nested transactions

Tom Lane wrote:

Bruce Momjian <pgman@candle.pha.pa.us> writes:

I should add that I am not prepared to overhaul the pg_clog file format
as part of adding subtransactions for 7.4. I can do the tid/sequential scan
method for abort, or the single-lock method described.

If you think that changing the pg_clog file format would be harder than
either of those other ideas, I think you're very badly mistaken.
pg_clog is touched only by one rather simple module.

Agreed, the clog changes would be the simple solution. However, I am
not sure I can make that level of changes. In fact, the complexity of
having multiple transactions per backend is going to be tough for me
too.

Also, I should point out that balooning pg_clog by 16x is going to mean
we are perhaps 4-8x more likely to need extra pages to mark all
subtransactions.

Isn't there some other way we can link these subtransactions together
rather than mucking with pg_clog, as we only need the linkage while we
mark them all committed?

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#19Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#18)
Re: nested transactions

Bruce Momjian <pgman@candle.pha.pa.us> writes:

Also, I should point out that balooning pg_clog by 16x is going to mean
we are perhaps 4-8x more likely to need extra pages to mark all
subtransactions.

So? The critical point is that we don't need to serialize the pg_clog
operations if we do it that way. Also, we can certainly expand the
number of pg_clog pages held in memory by some amount. Right now it's
only 4, IIRC. We could make it 64 and probably no one would even
notice.

Isn't there some other way we can link these subtransactions together
rather than mucking with pg_clog, as we only need the linkage while we
mark them all committed?

You *cannot* expect to do it all in shared memory; you will be blown out
of the water by the first long transaction that comes along, if you try.
So the question is not whether we put the status into a file, it is only
what representation we choose.

Manfred suggested a separate log file ("pg_subclog" or some such) but
I really don't see any operational advantage to that. You still end up
with 4 bytes per transaction, you're just assuming that putting them
in a different file makes it better. I don't see how.

regards, tom lane

#20Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#19)
Re: nested transactions

Tom Lane wrote:

Isn't there some other way we can link these subtransactions together
rather than mucking with pg_clog, as we only need the linkage while we
mark them all committed?

You *cannot* expect to do it all in shared memory; you will be blown out
of the water by the first long transaction that comes along, if you try.
So the question is not whether we put the status into a file, it is only
what representation we choose.

Manfred suggested a separate log file ("pg_subclog" or some such) but
I really don't see any operational advantage to that. You still end up
with 4 bytes per transaction, you're just assuming that putting them
in a different file makes it better. I don't see how.

It only becomes better if we can throw away that file (or contents) when
the transaction completes and we have marked all the subtransactions as
completed. We can't compress pg_clog if we store the parent info in
there.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#21Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#20)
#22Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#21)
#23Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#22)
#24Manfred Koizar
mkoi-pg@aon.at
In reply to: Tom Lane (#19)
#25Matthew T. O'Connor
matthew@zeut.net
In reply to: Tom Lane (#23)
#26Tom Lane
tgl@sss.pgh.pa.us
In reply to: Manfred Koizar (#24)
#27Tom Lane
tgl@sss.pgh.pa.us
In reply to: Matthew T. O'Connor (#25)
#28Bruce Momjian
bruce@momjian.us
In reply to: Manfred Koizar (#11)
#29Manfred Koizar
mkoi-pg@aon.at
In reply to: Bruce Momjian (#13)
#30Tom Lane
tgl@sss.pgh.pa.us
In reply to: Manfred Koizar (#29)
#31Manfred Koizar
mkoi-pg@aon.at
In reply to: Tom Lane (#30)
#32Bruce Momjian
bruce@momjian.us
In reply to: Manfred Koizar (#31)
#33Tom Lane
tgl@sss.pgh.pa.us
In reply to: Manfred Koizar (#31)
#34Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#33)
#35Manfred Koizar
mkoi-pg@aon.at
In reply to: Bruce Momjian (#32)
#36Bruce Momjian
bruce@momjian.us
In reply to: Manfred Koizar (#35)
#37Kevin Brown
kevin@sysexperts.com
In reply to: Bruce Momjian (#1)
#38Bruce Momjian
bruce@momjian.us
In reply to: Kevin Brown (#37)