Interesting tight loop
In production today (8.1.4), I ran into a backend process that
wouldn't cancel right away -- minutes went by.
It was in
[0]: TransactionIdIsCurrentTransactionId
[1]: HeapTupleSatisfiesSnapshot ...
...
But the interesting thing is that there were 4.6 million elements in
the s->childXids list. Which is why it took so damn long. I can't
quite figure out how I induced this state. It is an OLAP server with
about 10-20 connection that run "long" queries (from 5 seconds to 24
hours).
If this is a common possible state, it seems that perhaps a hash of
the childXids would be more appropriate. Does the order of the
childXids chained off the current transaction state matter? Most of
the placed I could find that reference it seem to just attempt to
find an Xid in there. O(1) sounds a lot better than O(n) :-D
Best regards,
Theo
// Theo Schlossnagle
// CTO -- http://www.omniti.com/~jesus/
// OmniTI Computer Consulting, Inc. -- http://www.omniti.com/
Theo Schlossnagle <jesus@omniti.com> writes:
But the interesting thing is that there were 4.6 million elements in the
s->childXids list. Which is why it took so damn long. I can't quite figure
out how I induced this state. It is an OLAP server with about 10-20
connection that run "long" queries (from 5 seconds to 24 hours).
Wow, I think that means you've had 4.6 million subtransactions within the
current transaction. Is this a PL/PgSQL function or a script that uses
SAVEPOINT a lot?
--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
On Sep 14, 2006, at 7:03 AM, Gregory Stark wrote:
Theo Schlossnagle <jesus@omniti.com> writes:
But the interesting thing is that there were 4.6 million elements
in the
s->childXids list. Which is why it took so damn long. I can't
quite figure
out how I induced this state. It is an OLAP server with about 10-20
connection that run "long" queries (from 5 seconds to 24 hours).Wow, I think that means you've had 4.6 million subtransactions
within the
current transaction. Is this a PL/PgSQL function or a script that uses
SAVEPOINT a lot?
We don't use savepoint's too much. Maybe one or two across out 1k or
so pl/pgsql procs.
We use dbi-link which is plperl. Perhaps that is somehow creating
subtransactions?
// Theo Schlossnagle
// CTO -- http://www.omniti.com/~jesus/
// OmniTI Computer Consulting, Inc. -- http://www.omniti.com/
Theo Schlossnagle <jesus@omniti.com> writes:
We don't use savepoint's too much. Maybe one or two across out 1k or so
pl/pgsql procs.
Well if they're in a loop...
We use dbi-link which is plperl. Perhaps that is somehow creating
subtransactions?
Ok, I more or less see what's going on. plperl creates a subtransaction
whenever you execute an SPI query from inside a perl function. That's so that
errors in the query can throw perl exceptions and be caught in the perl code.
So if your DBI source is an SPI connection (and not a connection to some other
database source) you will get a subtransaction for every remote_select() call.
In addition, dbi-link seems to do its work by creating a trigger which fires
once for every record you modify in its "shadow table". I'm not sure what
you're doing with those records but if your sending them on via an SPI
connection to another table you'll get a subtransaction every time the trigger
fires.
It would be interesting to know which of these it is because in the former
case it may be something that could be fixed. We only really need to remember
subtransactions that have hit disk. But I rather suspect it's the latter case
since it's easy to see you firing a trigger 4.3M times.
--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Gregory Stark <stark@enterprisedb.com> writes:
Ok, I more or less see what's going on. plperl creates a subtransaction
whenever you execute an SPI query from inside a perl function. That's so that
errors in the query can throw perl exceptions and be caught in the perl code.
It might also be worthwhile modifying plperl to check $^S which indicates
whether exceptions are going to be caught anywhere. If they're not going to be
caught would it be possible to skip creating the nested transaction and allow
them to force the whole transaction to be rolled back?
I'm not clear though if there would there be any internal perl interpreter
state that would get messed up if we just longjmp out from within perl code
rather than allow perl exceptions to get us out.
--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
On Sep 14, 2006, at 8:19 AM, Gregory Stark wrote:
Theo Schlossnagle <jesus@omniti.com> writes:
We don't use savepoint's too much. Maybe one or two across out 1k
or so
pl/pgsql procs.Well if they're in a loop...
We use dbi-link which is plperl. Perhaps that is somehow creating
subtransactions?Ok, I more or less see what's going on. plperl creates a
subtransaction
whenever you execute an SPI query from inside a perl function.
That's so that
errors in the query can throw perl exceptions and be caught in the
perl code.So if your DBI source is an SPI connection (and not a connection to
some other
database source) you will get a subtransaction for every
remote_select() call.In addition, dbi-link seems to do its work by creating a trigger
which fires
once for every record you modify in its "shadow table". I'm not
sure what
you're doing with those records but if your sending them on via an SPI
connection to another table you'll get a subtransaction every time
the trigger
fires.It would be interesting to know which of these it is because in the
former
case it may be something that could be fixed. We only really need
to remember
subtransactions that have hit disk. But I rather suspect it's the
latter case
since it's easy to see you firing a trigger 4.3M times.
My remote_select() in DBI does a RETURN NEXT $row; You think that
might be the problem? If that's the case -- that needs to be fixed.
The metalevel of the remote_select is:
remote_select(query) {
handle = remote.prepare(query)
handle.execute;
while(row = handle.fetchrow_hashref) {
return_next $row;
}
handle.close;
return;
}
If that return_next is causing an subtransaction that would explain
my world of pain well.
// Theo Schlossnagle
// CTO -- http://www.omniti.com/~jesus/
// OmniTI Computer Consulting, Inc. -- http://www.omniti.com/
Theo Schlossnagle <jesus@omniti.com> writes:
In production today (8.1.4), I ran into a backend process that
wouldn't cancel right away -- minutes went by.
It was in
[0] TransactionIdIsCurrentTransactionId
[1] HeapTupleSatisfiesSnapshot
...
But the interesting thing is that there were 4.6 million elements in
the s->childXids list. Which is why it took so damn long.
Well, I don't think TransactionIdIsCurrentTransactionId() itself is to
blame --- the loop over childXids is about as tight as such a loop could
possibly be (two indirect fetches and a comparison per iteration).
Even with 4.6M child Xids it could hardly take a second on any modern
machine. I'm not inclined to add a CHECK_FOR_INTERRUPTS there. The
problem should instead be blamed on something further down the call
stack ... did you save the full stack by any chance?
regards, tom lane