abort-time portal cleanup

Started by Robert Haasover 6 years ago10 messages

robertmhaas@gmail.com

over 6 years ago

1 attachment(s)

I've been studying At{Sub,}{Abort,Cleanup}_Portals() for the last few
days and have come to the conclusion that the code is not entirely up
to our usual standards. I believe that a good deal of the reason for
this is attributable to the poor quality of the code comments in this
area, although there are perhaps some other contributing factors as
well, such as bullheadedness on my part and perhaps others.

The trouble starts with the header comment for AtAbort_Portals, which
states that "At this point we run the cleanup hook if present, but we
can't release the portal's memory until the cleanup call." At the time
this logic was introduced in commit
de28dc9a04c4df5d711815b7a518501b43535a26 (2003-05-02),
AtAbort_Portals() affected all non-held portals without caring whether
they were active or not, and, UserAbortTransactionBlock() called
AbortTransaction() directly, so typing "ROLLBACK;" would cause
AtAbort_Portals() to be reached from within PortalRun(). Even if
PortalRun() managed to return without crashing, the caller would next
try to call PortalDrop() on what was now an invalid pointer. However,
commit 8f9f1986034a2273e09ad10671e10d1adda21d1f (2004-09-16) changed
things so that UserAbortEndTransaction() just sets things up so that
the subsequent call to CommitTransactionCommand() would call
AbortTransaction() instead of trying to do it right away, and that
doesn't happen until after we're done with the portal. As far as I
can see, that change made this comment mostly false, but the comment
has nevertheless managed to survive for another ~15 years. I think we
can, and in fact should, just drop the portal right here.

As far as actually making that work, there are a few wrinkles. The
first is that we might be in the middle of FATAL. In that case, unlike
the ROLLBACK case, a call to PortalRun() is still on the stack, but
we'll exit the process rather than returning, so the fact that we've
created a dangling pointer for the caller won't matter. However, as
shown by commit ad9a274778d2d88c46b90309212b92ee7fdf9afe (2018-02-01)
and the report that led up to it at
/messages/by-id/20180128034531.h6o4w3727ifof3jy@alap3.anarazel.de
it's not a good idea to try to clean up the portal in that case,
because we might've already started shutting down critical systems.
It seems not only risky but also unnecessary: our process-local state
is about to go away, and the executor shouldn't need to clean up any
shared memory state that won't also get cleaned up by some other
mechanism. So, it seems to me that if we reach AtAbort_Portals()
during FATAL processing, we should either (1) do nothing at all and
just return or (2) forget about all the existing portals without
cleaning them up and then return. The second option seems a little
safer to me, because it guarantees that if we somehow reach code that
might try to look up a portal later, it won't find anything. But I
think it's arguable.

The second wrinkle is that there might be an active portal. Apart
from the FATAL case already mentioned, I think the only way this can
happen is some C code that calls purposefully calls AbortTransaction()
in the middle of executing a command. It can't be an ERROR, because
then the portal would be marked as failed before we get here, and it
can't be an explicit ROLLBACK, because as noted above, that case was
changed 15 years ago. It's got to be some other case where C code
calls AbortTransaction() voluntarily in the middle of a statement. For
over a decade, there were no cases at all of this type, but the code
in this function catered to hypothetical cases by marking the portal
failed. By 2016, Noah had figured out that this was bogus, and that
any future cases would likely require different handling, but Tom and
I shouted him down:

/messages/by-id/67674.1454259004@sss.pgh.pa.us

The end result of that discussion was commit
41baee7a9312eefb315b6b2973ac058c9efaa9cf (2016-02-05) which left the
code as it was but added comments nothing that it was wrong. It
actually wasn't entirely wrong, because it handled the FATAL case
mentioned above by the byzantine mechanism of invoking the portal's
cleanup callback after first setting the status to PORTAL_FAILED.
Since the only existing cleanup callback arranges to do nothing if the
status is PORTAL_FAILED, this worked out to a complicated way of
(correctly) skipping callback in the FATAL case.

But, probably because that wasn't documented in any understandable
way, possibly because nobody really understood it, when commit
8561e4840c81f7e345be2df170839846814fa004 (2018-01-22) added support
for transaction control in procedures, it just removed the code
marking active portals as failed, just as Noah had predicted would be
necessary ~2 years earlier. Andres noticed that this broke the FATAL
case and tracked it back to the removal of this code, resulting it it
getting put back, but just for the FATAL case, in commit
ad9a274778d2d88c46b90309212b92ee7fdf9afe (2018-02-01). See also
discussion at:

/messages/by-id/20180128034531.h6o4w3727ifof3jy@alap3.anarazel.de

I think that the code here still isn't really right. Apart from the
fact that the comments don't explain anything very clearly and the
FATAL case is handled in a way that looks overly complicated and
accidental, the handling in AtSubAbort_Portals() hasn't been updated,
so it's now inconsistent with AtAbort_Portals() as to both substance
and comments. I think that's only held together because stored
procedures don't yet support explicit control over subtransactions,
only top-level transactions.

Stepping back a bit, stored procedures are a good example of a command
that uses multiple transactions internally. We have a few others, such
as VACUUM, but at present, that case only commits transactions
internally; it does not roll them back internally. If it did, it
would want the same thing that procedures want, namely, to leave the
active portal alone. It doesn't quite work to leave the active portal
completely alone, because the portal has got a pointer to a
ResourceOwner which is about to be freed by end-of-transaction
cleanup; at the least, we've got to clear the pointer to that when the
multi-transaction statement eventually finishes in some future
transaction, it doesn't try to do anything with a dangling pointer.
But note that the resource owner can't really be in use in this
situation anyway; if it is, then the transaction abort is about to
blow away resources that the statement still needs. Similarly, the
statement can't be using any transaction-local memory, because that
too is about to get blown away. The only thing that can work at all
here is a statement that's been carefully designed to be able to
survive starting and ending transactions internally. Such a statement
must not rely on any transaction-local resources. The only thing I'm
sure we have to do is set portal->resowner to NULL. Calling the
cleanup hook, as the current code does, can't be right, because we'd
be cleaning up something that isn't going away. I think it only works
now because this is another case where the cleanup hook arranges to do
nothing in the cases where calling it is wrong in the first place. The
current code also calls PortalReleaseCachedPlan in this case; I'm not
100% certain whether that's appropriate or not.

Attached is a patch that (1) removes At(Sub)Cleanup_Portals() entirely
in favor of dropping portals on the spot in At(Sub)Abort_Portals(),
(2) overhauls the comments in this area, and (3) makes
AtSubAbort_Portals() consistent with AtAbort_Portals(). One possible
concern here - which Andres mentioned to me during off-list discussion
- is that someone might create a portal for a ROLLBACK statement, and
then try to execute that portal after an error has occurred. In
theory, keeping the portal around between abort time and cleanup time
might allow this to work, but it actually doesn't work, so there's no
functional regression. As noted by Amit Kapila also during off-list
discussion, IsTransactionStmtList() only works if portal->stmts is
set, and PortalReleaseCachedPlan() clears portal->stmts, so once we've
aborted the transaction, any previously-created portals are
unrecognizable as transaction statements.

This area is incredibly confusing to understand, so it's entirely
possible that I've gotten some things wrong. However, I think it's
worth the effort to try to do some cleanup here, because I think it's
overly complex and under-documented. On top of the commits already
mentioned, some of which I think demonstrate that the issues here
haven't been completely understood, I found other bug fix commits that
look like bugs that might've never happened in the first place if this
weren't all so confusing.

Feedback appreciated,

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

0001-Improve-handling-of-portals-after-sub-transaction-ab.patchapplication/octet-stream; name=0001-Improve-handling-of-portals-after-sub-transaction-ab.patchDownload

From 2769f4b86b5e2f7d6044dafceaffe609f2ca47aa Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Thu, 12 Sep 2019 16:33:09 -0400
Subject: [PATCH] Improve handling of portals after (sub)transaction abort.

Remove At(Sub)Cleanup_Portals in favor of nuking portals directly
in At(Sub)Abort_Portals.  Fix comments that incorrect claim this
won't work, and improve other comments. Make AtSubAbort_Portals
more consistent with AtAbort_Portals.
---
 src/backend/access/transam/xact.c  |   3 -
 src/backend/utils/mmgr/portalmem.c | 279 ++++++++---------------------
 src/include/utils/portal.h         |   1 -
 3 files changed, 79 insertions(+), 204 deletions(-)

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 9162286c98..0b5baa8b0a 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2756,7 +2756,6 @@ CleanupTransaction(void)
 	/*
 	 * do abort cleanup processing
 	 */
-	AtCleanup_Portals();		/* now safe to release portal memory */
 	AtEOXact_Snapshot(false, true); /* and release the transaction's snapshots */
 
 	CurrentResourceOwner = NULL;	/* and resource owner */
@@ -5032,8 +5031,6 @@ CleanupSubTransaction(void)
 		elog(WARNING, "CleanupSubTransaction while in %s state",
 			 TransStateAsString(s->state));
 
-	AtSubCleanup_Portals(s->subTransactionId);
-
 	CurrentResourceOwner = s->parent->curTransactionOwner;
 	CurTransactionResourceOwner = s->parent->curTransactionOwner;
 	if (s->curTransactionOwner)
diff --git a/src/backend/utils/mmgr/portalmem.c b/src/backend/utils/mmgr/portalmem.c
index 334e35bb6a..ec21636066 100644
--- a/src/backend/utils/mmgr/portalmem.c
+++ b/src/backend/utils/mmgr/portalmem.c
@@ -417,10 +417,6 @@ MarkPortalDone(Portal portal)
 	/*
 	 * Allow portalcmds.c to clean up the state it knows about.  We might as
 	 * well do that now, since the portal can't be executed any more.
-	 *
-	 * In some cases involving execution of a ROLLBACK command in an already
-	 * aborted transaction, this is necessary, or we'd reach AtCleanup_Portals
-	 * with the cleanup hook still unexecuted.
 	 */
 	if (PointerIsValid(portal->cleanup))
 	{
@@ -445,10 +441,6 @@ MarkPortalFailed(Portal portal)
 	/*
 	 * Allow portalcmds.c to clean up the state it knows about.  We might as
 	 * well do that now, since the portal can't be executed any more.
-	 *
-	 * In some cases involving cleanup of an already aborted transaction, this
-	 * is necessary, or we'd reach AtCleanup_Portals with the cleanup hook
-	 * still unexecuted.
 	 */
 	if (PointerIsValid(portal->cleanup))
 	{
@@ -765,8 +757,11 @@ PreCommit_Portals(bool isPrepare)
 /*
  * Abort processing for portals.
  *
- * At this point we run the cleanup hook if present, but we can't release the
- * portal's memory until the cleanup call.
+ * Most portals don't and shouldn't survive transaction abort, but there are
+ * some important special cases where they do and must: (1) held portals must
+ * survive by definition, and (2) any active portal must be part of a command
+ * that uses multiple transactions internally, and needs to survive until
+ * execution of that command has completed.
  */
 void
 AtAbort_Portals(void)
@@ -781,14 +776,22 @@ AtAbort_Portals(void)
 		Portal		portal = hentry->portal;
 
 		/*
-		 * When elog(FATAL) is progress, we need to set the active portal to
-		 * failed, so that PortalCleanup() doesn't run the executor shutdown.
+		 * If we're exiting due to a FATAL error, just blow away all portals
+		 * without really doing any cleanup.  Because the process is going
+		 * to exit anyway, it shouldn't really be necessary to do any cleanup.
+		 * It might also not be safe, if other parts of the system have
+		 * already been shut down. This also makes sure that no future
+		 * references to existing portals are possible.
 		 */
-		if (portal->status == PORTAL_ACTIVE && shmem_exit_inprogress)
-			MarkPortalFailed(portal);
+		if (shmem_exit_inprogress)
+		{
+			PortalHashTableDelete(portal);
+			continue;
+		}
 
 		/*
-		 * Do nothing else to cursors held over from a previous transaction.
+		 * Otherwise, do nothing to cursors held over from a previous
+		 * transaction.
 		 */
 		if (portal->createSubid == InvalidSubTransactionId)
 			continue;
@@ -802,98 +805,40 @@ AtAbort_Portals(void)
 			continue;
 
 		/*
-		 * If it was created in the current transaction, we can't do normal
-		 * shutdown on a READY portal either; it might refer to objects
-		 * created in the failed transaction.  See comments in
-		 * AtSubAbort_Portals.
-		 */
-		if (portal->status == PORTAL_READY)
-			MarkPortalFailed(portal);
-
-		/*
-		 * Allow portalcmds.c to clean up the state it knows about, if we
-		 * haven't already.
-		 */
-		if (PointerIsValid(portal->cleanup))
-		{
-			portal->cleanup(portal);
-			portal->cleanup = NULL;
-		}
-
-		/* drop cached plan reference, if any */
-		PortalReleaseCachedPlan(portal);
-
-		/*
-		 * Any resources belonging to the portal will be released in the
-		 * upcoming transaction-wide cleanup; they will be gone before we run
-		 * PortalDrop.
-		 */
-		portal->resowner = NULL;
-
-		/*
-		 * Although we can't delete the portal data structure proper, we can
-		 * release any memory in subsidiary contexts, such as executor state.
-		 * The cleanup hook was the last thing that might have needed data
-		 * there.  But leave active portals alone.
-		 */
-		if (portal->status != PORTAL_ACTIVE)
-			MemoryContextDeleteChildren(portal->portalContext);
-	}
-}
-
-/*
- * Post-abort cleanup for portals.
- *
- * Delete all portals not held over from prior transactions.  */
-void
-AtCleanup_Portals(void)
-{
-	HASH_SEQ_STATUS status;
-	PortalHashEnt *hentry;
-
-	hash_seq_init(&status, PortalHashTable);
-
-	while ((hentry = (PortalHashEnt *) hash_seq_search(&status)) != NULL)
-	{
-		Portal		portal = hentry->portal;
-
-		/*
-		 * Do not touch active portals --- this can only happen in the case of
-		 * a multi-transaction command.
+		 * If the status is PORTAL_ACTIVE, then we must be executing a command
+		 * that uses multiple transactions internally. In that case, the
+		 * command in question must be one that does not internally rely on
+		 * any transaction-lifetime resources, because they would disappear
+		 * in the upcoming transaction-wide cleanup.
 		 */
 		if (portal->status == PORTAL_ACTIVE)
-			continue;
-
-		/*
-		 * Do nothing to cursors held over from a previous transaction or
-		 * auto-held ones.
-		 */
-		if (portal->createSubid == InvalidSubTransactionId || portal->autoHeld)
 		{
-			Assert(portal->status != PORTAL_ACTIVE);
-			Assert(portal->resowner == NULL);
+			/*
+			 * The resource owner is about to be destroyed, so we must not
+			 * retain a pointer to it.
+			 */
+			portal->resowner = NULL;
 			continue;
 		}
 
 		/*
+		 * Drop the portal.
+		 *
+		 * We used to postpone the actual removal of the portal until
+		 * CleanupTransaction() time, but there's not much point, because
+		 * until then the transaction is in a failed state and existing portals
+		 * can't be used anyway.
+		 *
+		 * (Maybe it would be a good idea to rearrange things so that a
+		 * transaction statement whose portal was created before the abort
+		 * could still be used afterwards to roll back the transaction, but
+		 * historically we haven't cared to support that case.)
+		 *
 		 * If a portal is still pinned, forcibly unpin it. PortalDrop will not
 		 * let us drop the portal otherwise. Whoever pinned the portal was
 		 * interrupted by the abort too and won't try to use it anymore.
 		 */
-		if (portal->portalPinned)
-			portal->portalPinned = false;
-
-		/*
-		 * We had better not call any user-defined code during cleanup, so if
-		 * the cleanup hook hasn't been run yet, too bad; we'll just skip it.
-		 */
-		if (PointerIsValid(portal->cleanup))
-		{
-			elog(WARNING, "skipping cleanup for portal \"%s\"", portal->name);
-			portal->cleanup = NULL;
-		}
-
-		/* Zap it. */
+		portal->portalPinned = false;
 		PortalDrop(portal, false);
 	}
 }
@@ -958,11 +903,9 @@ AtSubCommit_Portals(SubTransactionId mySubid,
 /*
  * Subtransaction abort handling for portals.
  *
- * Deactivate portals created or used during the failed subtransaction.
+ * Destroy portals created or used during the failed subtransaction.
  * Note that per AtSubCommit_Portals, this will catch portals created/used
  * in descendants of the subtransaction too.
- *
- * We don't destroy any portals here; that's done in AtSubCleanup_Portals.
  */
 void
 AtSubAbort_Portals(SubTransactionId mySubid,
@@ -979,6 +922,40 @@ AtSubAbort_Portals(SubTransactionId mySubid,
 	{
 		Portal		portal = hentry->portal;
 
+		/*
+		 * If we're exiting due to a FATAL error, just blow away all portals,
+		 * same as in AtAbort_Portals.
+		 */
+		if (shmem_exit_inprogress)
+		{
+			PortalHashTableDelete(portal);
+			continue;
+		}
+
+		/*
+		 * If we find an active portal, it means that the calling code is
+		 * trying to roll back a subtransaction in mid-command. Note that we
+		 * wouldn't reach this case on ERROR, because an ERROR while running
+		 * a portal puts it into a failed state, nor would we reach it from
+		 * executing a normal ROLLBACK TO SAVEPOINT command, which arranges
+		 * to do the real work after the portal is no longer active.  But
+		 * C code that manipulates transactions explicitly could cause us
+		 * to reach this.
+		 *
+		 * We would have big problems here if this portal depends on any
+		 * resources proper to the current subtransaction, but let's leave
+		 * that problem to the (hypothetical) caller.
+		 */
+		if (portal->status == PORTAL_ACTIVE)
+		{
+			/*
+			 * This would be nonsensical if the portal had been created in
+			 * the current subtransaction.
+			 */
+			Assert(portal->createSubid != mySubid);
+			continue;
+		}
+
 		/* Was it created in this subtransaction? */
 		if (portal->createSubid != mySubid)
 		{
@@ -988,25 +965,6 @@ AtSubAbort_Portals(SubTransactionId mySubid,
 				/* Maintain activeSubid until the portal is removed */
 				portal->activeSubid = parentSubid;
 
-				/*
-				 * A MarkPortalActive() caller ran an upper-level portal in
-				 * this subtransaction and left the portal ACTIVE.  This can't
-				 * happen, but force the portal into FAILED state for the same
-				 * reasons discussed below.
-				 *
-				 * We assume we can get away without forcing upper-level READY
-				 * portals to fail, even if they were run and then suspended.
-				 * In theory a suspended upper-level portal could have
-				 * acquired some references to objects that are about to be
-				 * destroyed, but there should be sufficient defenses against
-				 * such cases: the portal's original query cannot contain such
-				 * references, and any references within, say, cached plans of
-				 * PL/pgSQL functions are not from active queries and should
-				 * be protected by revalidation logic.
-				 */
-				if (portal->status == PORTAL_ACTIVE)
-					MarkPortalFailed(portal);
-
 				/*
 				 * Also, if we failed it during the current subtransaction
 				 * (either just above, or earlier), reattach its resource
@@ -1028,89 +986,10 @@ AtSubAbort_Portals(SubTransactionId mySubid,
 		}
 
 		/*
-		 * Force any live portals of my own subtransaction into FAILED state.
-		 * We have to do this because they might refer to objects created or
-		 * changed in the failed subtransaction, leading to crashes within
-		 * ExecutorEnd when portalcmds.c tries to close down the portal.
-		 * Currently, every MarkPortalActive() caller ensures it updates the
-		 * portal status again before relinquishing control, so ACTIVE can't
-		 * happen here.  If it does happen, dispose the portal like existing
-		 * MarkPortalActive() callers would.
+		 * Portals created in this subtransaction should be dropped, even
+		 * if pinned. See comments in AtAbort_Portals for more details.
 		 */
-		if (portal->status == PORTAL_READY ||
-			portal->status == PORTAL_ACTIVE)
-			MarkPortalFailed(portal);
-
-		/*
-		 * Allow portalcmds.c to clean up the state it knows about, if we
-		 * haven't already.
-		 */
-		if (PointerIsValid(portal->cleanup))
-		{
-			portal->cleanup(portal);
-			portal->cleanup = NULL;
-		}
-
-		/* drop cached plan reference, if any */
-		PortalReleaseCachedPlan(portal);
-
-		/*
-		 * Any resources belonging to the portal will be released in the
-		 * upcoming transaction-wide cleanup; they will be gone before we run
-		 * PortalDrop.
-		 */
-		portal->resowner = NULL;
-
-		/*
-		 * Although we can't delete the portal data structure proper, we can
-		 * release any memory in subsidiary contexts, such as executor state.
-		 * The cleanup hook was the last thing that might have needed data
-		 * there.
-		 */
-		MemoryContextDeleteChildren(portal->portalContext);
-	}
-}
-
-/*
- * Post-subabort cleanup for portals.
- *
- * Drop all portals created in the failed subtransaction (but note that
- * we will not drop any that were reassigned to the parent above).
- */
-void
-AtSubCleanup_Portals(SubTransactionId mySubid)
-{
-	HASH_SEQ_STATUS status;
-	PortalHashEnt *hentry;
-
-	hash_seq_init(&status, PortalHashTable);
-
-	while ((hentry = (PortalHashEnt *) hash_seq_search(&status)) != NULL)
-	{
-		Portal		portal = hentry->portal;
-
-		if (portal->createSubid != mySubid)
-			continue;
-
-		/*
-		 * If a portal is still pinned, forcibly unpin it. PortalDrop will not
-		 * let us drop the portal otherwise. Whoever pinned the portal was
-		 * interrupted by the abort too and won't try to use it anymore.
-		 */
-		if (portal->portalPinned)
-			portal->portalPinned = false;
-
-		/*
-		 * We had better not call any user-defined code during cleanup, so if
-		 * the cleanup hook hasn't been run yet, too bad; we'll just skip it.
-		 */
-		if (PointerIsValid(portal->cleanup))
-		{
-			elog(WARNING, "skipping cleanup for portal \"%s\"", portal->name);
-			portal->cleanup = NULL;
-		}
-
-		/* Zap it. */
+		portal->portalPinned = false;
 		PortalDrop(portal, false);
 	}
 }
diff --git a/src/include/utils/portal.h b/src/include/utils/portal.h
index 098c837e0a..918041d672 100644
--- a/src/include/utils/portal.h
+++ b/src/include/utils/portal.h
@@ -214,7 +214,6 @@ extern void AtSubAbort_Portals(SubTransactionId mySubid,
 							   SubTransactionId parentSubid,
 							   ResourceOwner myXactOwner,
 							   ResourceOwner parentXactOwner);
-extern void AtSubCleanup_Portals(SubTransactionId mySubid);
 extern Portal CreatePortal(const char *name, bool allowDup, bool dupSilent);
 extern Portal CreateNewPortal(void);
 extern void PinPortal(Portal portal);
-- 
2.17.2 (Apple Git-113)

Amit Kapila

amit.kapila16@gmail.com

over 6 years ago

In reply to: Robert Haas (#1)

Re: abort-time portal cleanup

On Fri, Sep 13, 2019 at 2:13 AM Robert Haas <robertmhaas@gmail.com> wrote:

I've been studying At{Sub,}{Abort,Cleanup}_Portals() for the last few
days and have come to the conclusion that the code is not entirely up
to our usual standards. I believe that a good deal of the reason for
this is attributable to the poor quality of the code comments in this
area, although there are perhaps some other contributing factors as
well, such as bullheadedness on my part and perhaps others.

The trouble starts with the header comment for AtAbort_Portals, which
states that "At this point we run the cleanup hook if present, but we
can't release the portal's memory until the cleanup call." At the time
this logic was introduced in commit
de28dc9a04c4df5d711815b7a518501b43535a26 (2003-05-02),
AtAbort_Portals() affected all non-held portals without caring whether
they were active or not, and, UserAbortTransactionBlock() called
AbortTransaction() directly, so typing "ROLLBACK;" would cause
AtAbort_Portals() to be reached from within PortalRun(). Even if
PortalRun() managed to return without crashing, the caller would next
try to call PortalDrop() on what was now an invalid pointer. However,
commit 8f9f1986034a2273e09ad10671e10d1adda21d1f (2004-09-16) changed
things so that UserAbortEndTransaction() just sets things up so that
the subsequent call to CommitTransactionCommand() would call
AbortTransaction() instead of trying to do it right away, and that
doesn't happen until after we're done with the portal. As far as I
can see, that change made this comment mostly false, but the comment
has nevertheless managed to survive for another ~15 years. I think we
can, and in fact should, just drop the portal right here.

As far as actually making that work, there are a few wrinkles. The
first is that we might be in the middle of FATAL. In that case, unlike
the ROLLBACK case, a call to PortalRun() is still on the stack, but
we'll exit the process rather than returning, so the fact that we've
created a dangling pointer for the caller won't matter. However, as
shown by commit ad9a274778d2d88c46b90309212b92ee7fdf9afe (2018-02-01)
and the report that led up to it at
/messages/by-id/20180128034531.h6o4w3727ifof3jy@alap3.anarazel.de
it's not a good idea to try to clean up the portal in that case,
because we might've already started shutting down critical systems.
It seems not only risky but also unnecessary: our process-local state
is about to go away, and the executor shouldn't need to clean up any
shared memory state that won't also get cleaned up by some other
mechanism. So, it seems to me that if we reach AtAbort_Portals()
during FATAL processing, we should either (1) do nothing at all and
just return or (2) forget about all the existing portals without
cleaning them up and then return. The second option seems a little
safer to me, because it guarantees that if we somehow reach code that
might try to look up a portal later, it won't find anything. But I
think it's arguable.

I agree with your position on this.

Attached is a patch that (1) removes At(Sub)Cleanup_Portals() entirely
in favor of dropping portals on the spot in At(Sub)Abort_Portals(),
(2) overhauls the comments in this area, and (3) makes
AtSubAbort_Portals() consistent with AtAbort_Portals().

The overall idea seems good to me, but I have a few comments on the changes.

1.
@@ -2756,7 +2756,6 @@ CleanupTransaction(void)
  /*
  * do abort cleanup processing
  */
- AtCleanup_Portals(); /* now safe to release portal
memory */
  AtEOXact_Snapshot(false, true); /* and release the transaction's
snapshots */

CurrentResourceOwner = NULL; /* and resource owner */
@@ -5032,8 +5031,6 @@ CleanupSubTransaction(void)
elog(WARNING, "CleanupSubTransaction while in %s state",
TransStateAsString(s->state));

- AtSubCleanup_Portals(s->subTransactionId);
-

After this cleanup, I think we don't need At(Sub)Abort_Portals in
AbortOutOfAnyTransaction() for the states TBLOCK_(SUB)ABORT and
friends. This is because AbortTransaction itself would have zapped the
portal.

2. You seem to forgot removing AtCleanup_Portals() from portal.h

3.
/*
- * If it was created in the current transaction, we
can't do normal
- * shutdown on a READY portal either; it might refer to
objects
- * created in the failed transaction. See comments in
- * AtSubAbort_Portals.
- */
- if (portal->status == PORTAL_READY)
- MarkPortalFailed(portal);
-

Why it is safe to remove this check? It has been explained in commit
7981c342 why we need that check. I don't see any explanation in email
or patch which justifies this code removal. Is it because you removed
PortalCleanup? If so, that is still called from PortalDrop?

4.
-AtCleanup_Portals(void)
-{
- HASH_SEQ_STATUS status;
- PortalHashEnt *hentry;
-
- hash_seq_init(&status, PortalHashTable);
-
- while ((hentry = (PortalHashEnt *) hash_seq_search(&status)) !=
NULL)
- {
- Portal portal = hentry->portal;
-
- /*
- * Do not touch active portals --- this can only happen
in the case of
- * a multi-transaction command.
+ * If the status is PORTAL_ACTIVE, then we must be
executing a command
+ * that uses multiple transactions internally. In that
case, the
+ * command in question must be one that does not
internally rely on
+ * any transaction-lifetime resources, because they
would disappear
+ * in the upcoming transaction-wide cleanup.
  */
  if (portal->status == PORTAL_ACTIVE)

I am not able to understand how we can reach with the portal state as
'active' for a multi-transaction command. It seems wherever we mark
portal as active, we don't relinquish the control unless its state is
changed. Can you share some example where this can happen?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Dilip Kumar

dilipbalaut@gmail.com

over 6 years ago

In reply to: Robert Haas (#1)

Re: abort-time portal cleanup

On Fri, Sep 13, 2019 at 2:13 AM Robert Haas <robertmhaas@gmail.com> wrote:

/*
* Otherwise, do nothing to cursors held over from a previous
* transaction.
*/
if (portal->createSubid == InvalidSubTransactionId)
continue;

/*
* Do nothing to auto-held cursors. This is similar to the case of a
* cursor from a previous transaction, but it could also be that the
* cursor was auto-held in this transaction, so it wants to live on.
*/
if (portal->autoHeld)
continue;

I have one doubt that why do we need the second check. Because before
setting portal->autoHeld to true we always call HoldPortal therein we
set portal->createSubid to InvalidSubTransactionId. So it seems to me
that the second condition will never reach. Am I missing something?

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Andres Freund

andres@anarazel.de

over 6 years ago

In reply to: Robert Haas (#1)

Re: abort-time portal cleanup

Hi,

On 2019-09-12 16:42:46 -0400, Robert Haas wrote:

The trouble starts with the header comment for AtAbort_Portals, which
states that "At this point we run the cleanup hook if present, but we
can't release the portal's memory until the cleanup call." At the time
this logic was introduced in commit
de28dc9a04c4df5d711815b7a518501b43535a26 (2003-05-02),
AtAbort_Portals() affected all non-held portals without caring whether
they were active or not, and, UserAbortTransactionBlock() called
AbortTransaction() directly, so typing "ROLLBACK;" would cause
AtAbort_Portals() to be reached from within PortalRun(). Even if
PortalRun() managed to return without crashing, the caller would next
try to call PortalDrop() on what was now an invalid pointer. However,
commit 8f9f1986034a2273e09ad10671e10d1adda21d1f (2004-09-16) changed
things so that UserAbortEndTransaction() just sets things up so that
the subsequent call to CommitTransactionCommand() would call
AbortTransaction() instead of trying to do it right away, and that
doesn't happen until after we're done with the portal. As far as I
can see, that change made this comment mostly false, but the comment
has nevertheless managed to survive for another ~15 years. I think we
can, and in fact should, just drop the portal right here.

Nice digging.

As far as actually making that work, there are a few wrinkles. The
first is that we might be in the middle of FATAL. In that case, unlike
the ROLLBACK case, a call to PortalRun() is still on the stack, but
we'll exit the process rather than returning, so the fact that we've
created a dangling pointer for the caller won't matter. However, as
shown by commit ad9a274778d2d88c46b90309212b92ee7fdf9afe (2018-02-01)
and the report that led up to it at
/messages/by-id/20180128034531.h6o4w3727ifof3jy@alap3.anarazel.de
it's not a good idea to try to clean up the portal in that case,
because we might've already started shutting down critical systems.
It seems not only risky but also unnecessary: our process-local state
is about to go away, and the executor shouldn't need to clean up any
shared memory state that won't also get cleaned up by some other
mechanism. So, it seems to me that if we reach AtAbort_Portals()
during FATAL processing, we should either (1) do nothing at all and
just return or (2) forget about all the existing portals without
cleaning them up and then return. The second option seems a little
safer to me, because it guarantees that if we somehow reach code that
might try to look up a portal later, it won't find anything. But I
think it's arguable.

Hm. Doing that cleanup requires digging through all the portals etc. I'd
rather rely on less state being correct than more during FATAL
processing.

The second wrinkle is that there might be an active portal. Apart
from the FATAL case already mentioned, I think the only way this can
happen is some C code that calls purposefully calls AbortTransaction()
in the middle of executing a command. It can't be an ERROR, because
then the portal would be marked as failed before we get here, and it
can't be an explicit ROLLBACK, because as noted above, that case was
changed 15 years ago. It's got to be some other case where C code
calls AbortTransaction() voluntarily in the middle of a statement. For
over a decade, there were no cases at all of this type, but the code
in this function catered to hypothetical cases by marking the portal
failed. By 2016, Noah had figured out that this was bogus, and that
any future cases would likely require different handling, but Tom and
I shouted him down:

Hm. But wouldn't doing so run into a ton of problems anyway? I mean we
need to do a ton of checks and special case hangups to make
CommitTransactionCommand();StartTransactionCommand(); work for VACUUM,
CIC, ...

The cases where one can use AbortTransaction() (via
AbortCurrentTransaction() presumably) are ones where either there's no
surrounding code relying on the transaction (e.g. autovacuum,
postgres.c), or where special care has been taken with portals
(e.g. _SPI_rollback()). We didn't have the pin mechanism back then, so
I think even if we accept your/Tom's reasoning from back then (I don't
really), it's outdated now that the pin mechanism exists.

I'd be happy if we added some defenses against such bogus cases being
introduced (i.e. erroring out if we encounter an active portal during
abort processing).

Stepping back a bit, stored procedures are a good example of a command
that uses multiple transactions internally. We have a few others, such
as VACUUM, but at present, that case only commits transactions
internally; it does not roll them back internally. If it did, it
would want the same thing that procedures want, namely, to leave the
active portal alone. It doesn't quite work to leave the active portal
completely alone, because the portal has got a pointer to a
ResourceOwner which is about to be freed by end-of-transaction
cleanup;

Well, that's why _SPI_commit()/_SPI_rollback() do a HoldPinnedPortals(),
which, via HoldPortal(), sets portal->resowner to = NULL.

The current code also calls PortalReleaseCachedPlan in this case; I'm
not 100% certain whether that's appropriate or not.

I think it's appropriate, because we cannot guarantee that the plan is
still usable. Besides normal plan invalidation issues, the catalog
contents the plan might rely on might have only existed in the aborted
transaction - which seems like a fatal problem. That's why holding
portals persists the portalstore. Which then also obsoletes the plan.

Attached is a patch that (1) removes At(Sub)Cleanup_Portals() entirely
in favor of dropping portals on the spot in At(Sub)Abort_Portals(),
(2) overhauls the comments in this area, and (3) makes
AtSubAbort_Portals() consistent with AtAbort_Portals(). One possible
concern here - which Andres mentioned to me during off-list discussion
- is that someone might create a portal for a ROLLBACK statement, and
then try to execute that portal after an error has occurred.

Besides not quite happening as I though, as you reference below, I think
that only mattered if a rollback gets executed in a failed transaction -
that has to happen after a sync message, because otherwise we'd just
skip the command. But to be problematic, the bind would have to be from
before the sync, and the exec from after - which'd be an extremely
absurd use of the protocol.

/*
* Abort processing for portals.
*
- * At this point we run the cleanup hook if present, but we can't release the
- * portal's memory until the cleanup call.
+ * Most portals don't and shouldn't survive transaction abort, but there are
+ * some important special cases where they do and must: (1) held portals must
+ * survive by definition, and (2) any active portal must be part of a command
+ * that uses multiple transactions internally, and needs to survive until
+ * execution of that command has completed.

Hm. Why are active, rather than pinnned, portals relevant here? Normal
multi-transactional commands (e.g. vacuum, CIC) shouldn't get here with
an active portal, as they don't catch errors. And procedures should have
pinned the relevant portals?

Greetings,

Andres Freund

Robert Haas

robertmhaas@gmail.com

over 6 years ago

In reply to: Andres Freund (#4)

Re: abort-time portal cleanup

On Tue, Sep 24, 2019 at 4:45 PM Andres Freund <andres@anarazel.de> wrote:

Hm. Doing that cleanup requires digging through all the portals etc. I'd
rather rely on less state being correct than more during FATAL
processing.

I agree with the principle, but the advantage of removing the hash
table entries is that it protects against any other code that might
try to access the portal machinery later; I thought that was
worthwhile enough to justify doing this here. I don't feel
super-strongly about it, but I do still like that approach.

The second wrinkle is that there might be an active portal. Apart
from the FATAL case already mentioned, I think the only way this can
happen is some C code that calls purposefully calls AbortTransaction()
in the middle of executing a command. It can't be an ERROR, because
then the portal would be marked as failed before we get here, and it
can't be an explicit ROLLBACK, because as noted above, that case was
changed 15 years ago. It's got to be some other case where C code
calls AbortTransaction() voluntarily in the middle of a statement. For
over a decade, there were no cases at all of this type, but the code
in this function catered to hypothetical cases by marking the portal
failed. By 2016, Noah had figured out that this was bogus, and that
any future cases would likely require different handling, but Tom and
I shouted him down:

Hm. But wouldn't doing so run into a ton of problems anyway? I mean we
need to do a ton of checks and special case hangups to make
CommitTransactionCommand();StartTransactionCommand(); work for VACUUM,
CIC, ...

Right... those are the kinds of cases I'm talking about here, just for
abort rather than commit.

The cases where one can use AbortTransaction() (via
AbortCurrentTransaction() presumably) are ones where either there's no
surrounding code relying on the transaction (e.g. autovacuum,
postgres.c), or where special care has been taken with portals
(e.g. _SPI_rollback()). We didn't have the pin mechanism back then, so
I think even if we accept your/Tom's reasoning from back then (I don't
really), it's outdated now that the pin mechanism exists.

It isn't, actually. To respond to this and also your question below
about why I'm looking at active portal rather than pinned portals, try
adding this debugging code to AtAbort_Portals:

+        if (portal->status == PORTAL_ACTIVE)
+            elog(NOTICE, "this portal is ACTIVE and %spinned",
+                 portal->portalPinned ? "" : "NOT ");

Then run 'make -C src/pl/plpgsql check' and check
src/pl/plpgsql/src/regression.diffs and you'll see a whole lot of
this:

+NOTICE: this portal is ACTIVE and NOT pinned

The PLs pin the portals they generate internally, but they don't force
the surrounding portal in which the toplevel query is executing to be
pinned. AFAICT, pinning is mostly guarding against explicit
user-initiated drops of portals that were automatically generated by a
PL, whereas the portal's state is about tracking what the system is
doing with the portal.

(I think this could be a lot better documented than it is, but looking
at the commit history, I'm fairly sure that's what is happening here.)

I'd be happy if we added some defenses against such bogus cases being
introduced (i.e. erroring out if we encounter an active portal during
abort processing).

Erroring out during error handling is probably a bad idea, but also, see above.

Stepping back a bit, stored procedures are a good example of a command
that uses multiple transactions internally. We have a few others, such
as VACUUM, but at present, that case only commits transactions
internally; it does not roll them back internally. If it did, it
would want the same thing that procedures want, namely, to leave the
active portal alone. It doesn't quite work to leave the active portal
completely alone, because the portal has got a pointer to a
ResourceOwner which is about to be freed by end-of-transaction
cleanup;

Well, that's why _SPI_commit()/_SPI_rollback() do a HoldPinnedPortals(),
which, via HoldPortal(), sets portal->resowner to = NULL.

Right, but it's still necessary for AtAbort_Portals() to do the same
thing, again because the top-level portal isn't pinned. If you apply
my patch, comment out the line that does portal->resowner = NULL; for
an active portal, and run make -C src/pl/plpgsql check, it will seg
fault inside exec_simple_query -> PortalDrop -> ResourceOwnerRelease
-> etc.

The current code also calls PortalReleaseCachedPlan in this case; I'm
not 100% certain whether that's appropriate or not.

I think it's appropriate, because we cannot guarantee that the plan is
still usable. Besides normal plan invalidation issues, the catalog
contents the plan might rely on might have only existed in the aborted
transaction - which seems like a fatal problem. That's why holding
portals persists the portalstore. Which then also obsoletes the plan.

In a case like this, it can't really be an actual planable statement.
If the executor were involved, aborting the running transaction and
starting a new one would certainly result in a crash, because the
memory context in which we had saved all of our working state would be
vanish out from under us. It's got to be a procedure call or maybe
some kind of DDL command that has been specially-crafted to survive a
mid-command abort. It's not clear to me that the issues for plan
invalidation and catalog contents are the same in those kinds of cases
as they are for planable statements, but perhaps they are.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Robert Haas

robertmhaas@gmail.com

over 6 years ago

In reply to: Amit Kapila (#2)

1 attachment(s)

Re: abort-time portal cleanup

On Tue, Sep 24, 2019 at 5:31 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

After this cleanup, I think we don't need At(Sub)Abort_Portals in
AbortOutOfAnyTransaction() for the states TBLOCK_(SUB)ABORT and
friends. This is because AbortTransaction itself would have zapped the
portal.

Not if the ROLLBACK itself failed - in that case, the portal would
have been active at the time, and thus not subject to removal. And, as
the existing comments in xact.c state, that's exactly why that
function call is there.

2. You seem to forgot removing AtCleanup_Portals() from portal.h

Oops. Fixed in the attached version.

- if (portal->status == PORTAL_READY)
- MarkPortalFailed(portal);

Why it is safe to remove this check? It has been explained in commit
7981c342 why we need that check. I don't see any explanation in email
or patch which justifies this code removal. Is it because you removed
PortalCleanup? If so, that is still called from PortalDrop?

All MarkPortalFailed() does is change the status to PORTAL_FAILED and
call the cleanup hook. PortalDrop() calls the cleanup hook, and we
don't need to change the status if we're removing it completely.

4.
+ * If the status is PORTAL_ACTIVE, then we must be
executing a command
+ * that uses multiple transactions internally. In that
case, the
+ * command in question must be one that does not
internally rely on
+ * any transaction-lifetime resources, because they
would disappear
+ * in the upcoming transaction-wide cleanup.
*/
if (portal->status == PORTAL_ACTIVE)
I am not able to understand how we can reach with the portal state as
'active' for a multi-transaction command. It seems wherever we mark
portal as active, we don't relinquish the control unless its state is
changed. Can you share some example where this can happen?

Yeah -- a plpgsql function or procedure that does "ROLLBACK;"
internally. The calling code doesn't relinquish control, but it does
reach AbortTransaction().

If you want to see it happen, just put an elog() inside that block and
run make -C src/pl/plpgsql check.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

v2-0001-Improve-handling-of-portals-after-sub-transaction.patchapplication/octet-stream; name=v2-0001-Improve-handling-of-portals-after-sub-transaction.patchDownload

From 925ea608bb42e1d5b040991ed1c4cb121398f917 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Mon, 7 Oct 2019 12:12:57 -0400
Subject: [PATCH v2] Improve handling of portals after (sub)transaction abort.

Remove At(Sub)Cleanup_Portals in favor of nuking portals directly
in At(Sub)Abort_Portals.  Fix comments that incorrect claim this
won't work, and improve other comments. Make AtSubAbort_Portals
more consistent with AtAbort_Portals.

Patch by me, reviewed by Andres Freund, Amit Kapila, and Dilip Kumar.

Discussion: http://postgr.es/m/CA+TgmobqtKQrWuJLps61N62wsV1GsW1sjvnXf4-C7qXi43=+OQ@mail.gmail.com
---
 src/backend/access/transam/xact.c  |   3 -
 src/backend/utils/mmgr/portalmem.c | 279 ++++++++---------------------
 src/include/utils/portal.h         |   2 -
 3 files changed, 79 insertions(+), 205 deletions(-)

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index fc55fa6d53..9a468f7ba5 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2756,7 +2756,6 @@ CleanupTransaction(void)
 	/*
 	 * do abort cleanup processing
 	 */
-	AtCleanup_Portals();		/* now safe to release portal memory */
 	AtEOXact_Snapshot(false, true); /* and release the transaction's snapshots */
 
 	CurrentResourceOwner = NULL;	/* and resource owner */
@@ -5031,8 +5030,6 @@ CleanupSubTransaction(void)
 		elog(WARNING, "CleanupSubTransaction while in %s state",
 			 TransStateAsString(s->state));
 
-	AtSubCleanup_Portals(s->subTransactionId);
-
 	CurrentResourceOwner = s->parent->curTransactionOwner;
 	CurTransactionResourceOwner = s->parent->curTransactionOwner;
 	if (s->curTransactionOwner)
diff --git a/src/backend/utils/mmgr/portalmem.c b/src/backend/utils/mmgr/portalmem.c
index 334e35bb6a..ec21636066 100644
--- a/src/backend/utils/mmgr/portalmem.c
+++ b/src/backend/utils/mmgr/portalmem.c
@@ -417,10 +417,6 @@ MarkPortalDone(Portal portal)
 	/*
 	 * Allow portalcmds.c to clean up the state it knows about.  We might as
 	 * well do that now, since the portal can't be executed any more.
-	 *
-	 * In some cases involving execution of a ROLLBACK command in an already
-	 * aborted transaction, this is necessary, or we'd reach AtCleanup_Portals
-	 * with the cleanup hook still unexecuted.
 	 */
 	if (PointerIsValid(portal->cleanup))
 	{
@@ -445,10 +441,6 @@ MarkPortalFailed(Portal portal)
 	/*
 	 * Allow portalcmds.c to clean up the state it knows about.  We might as
 	 * well do that now, since the portal can't be executed any more.
-	 *
-	 * In some cases involving cleanup of an already aborted transaction, this
-	 * is necessary, or we'd reach AtCleanup_Portals with the cleanup hook
-	 * still unexecuted.
 	 */
 	if (PointerIsValid(portal->cleanup))
 	{
@@ -765,8 +757,11 @@ PreCommit_Portals(bool isPrepare)
 /*
  * Abort processing for portals.
  *
- * At this point we run the cleanup hook if present, but we can't release the
- * portal's memory until the cleanup call.
+ * Most portals don't and shouldn't survive transaction abort, but there are
+ * some important special cases where they do and must: (1) held portals must
+ * survive by definition, and (2) any active portal must be part of a command
+ * that uses multiple transactions internally, and needs to survive until
+ * execution of that command has completed.
  */
 void
 AtAbort_Portals(void)
@@ -781,14 +776,22 @@ AtAbort_Portals(void)
 		Portal		portal = hentry->portal;
 
 		/*
-		 * When elog(FATAL) is progress, we need to set the active portal to
-		 * failed, so that PortalCleanup() doesn't run the executor shutdown.
+		 * If we're exiting due to a FATAL error, just blow away all portals
+		 * without really doing any cleanup.  Because the process is going
+		 * to exit anyway, it shouldn't really be necessary to do any cleanup.
+		 * It might also not be safe, if other parts of the system have
+		 * already been shut down. This also makes sure that no future
+		 * references to existing portals are possible.
 		 */
-		if (portal->status == PORTAL_ACTIVE && shmem_exit_inprogress)
-			MarkPortalFailed(portal);
+		if (shmem_exit_inprogress)
+		{
+			PortalHashTableDelete(portal);
+			continue;
+		}
 
 		/*
-		 * Do nothing else to cursors held over from a previous transaction.
+		 * Otherwise, do nothing to cursors held over from a previous
+		 * transaction.
 		 */
 		if (portal->createSubid == InvalidSubTransactionId)
 			continue;
@@ -802,98 +805,40 @@ AtAbort_Portals(void)
 			continue;
 
 		/*
-		 * If it was created in the current transaction, we can't do normal
-		 * shutdown on a READY portal either; it might refer to objects
-		 * created in the failed transaction.  See comments in
-		 * AtSubAbort_Portals.
-		 */
-		if (portal->status == PORTAL_READY)
-			MarkPortalFailed(portal);
-
-		/*
-		 * Allow portalcmds.c to clean up the state it knows about, if we
-		 * haven't already.
-		 */
-		if (PointerIsValid(portal->cleanup))
-		{
-			portal->cleanup(portal);
-			portal->cleanup = NULL;
-		}
-
-		/* drop cached plan reference, if any */
-		PortalReleaseCachedPlan(portal);
-
-		/*
-		 * Any resources belonging to the portal will be released in the
-		 * upcoming transaction-wide cleanup; they will be gone before we run
-		 * PortalDrop.
-		 */
-		portal->resowner = NULL;
-
-		/*
-		 * Although we can't delete the portal data structure proper, we can
-		 * release any memory in subsidiary contexts, such as executor state.
-		 * The cleanup hook was the last thing that might have needed data
-		 * there.  But leave active portals alone.
-		 */
-		if (portal->status != PORTAL_ACTIVE)
-			MemoryContextDeleteChildren(portal->portalContext);
-	}
-}
-
-/*
- * Post-abort cleanup for portals.
- *
- * Delete all portals not held over from prior transactions.  */
-void
-AtCleanup_Portals(void)
-{
-	HASH_SEQ_STATUS status;
-	PortalHashEnt *hentry;
-
-	hash_seq_init(&status, PortalHashTable);
-
-	while ((hentry = (PortalHashEnt *) hash_seq_search(&status)) != NULL)
-	{
-		Portal		portal = hentry->portal;
-
-		/*
-		 * Do not touch active portals --- this can only happen in the case of
-		 * a multi-transaction command.
+		 * If the status is PORTAL_ACTIVE, then we must be executing a command
+		 * that uses multiple transactions internally. In that case, the
+		 * command in question must be one that does not internally rely on
+		 * any transaction-lifetime resources, because they would disappear
+		 * in the upcoming transaction-wide cleanup.
 		 */
 		if (portal->status == PORTAL_ACTIVE)
-			continue;
-
-		/*
-		 * Do nothing to cursors held over from a previous transaction or
-		 * auto-held ones.
-		 */
-		if (portal->createSubid == InvalidSubTransactionId || portal->autoHeld)
 		{
-			Assert(portal->status != PORTAL_ACTIVE);
-			Assert(portal->resowner == NULL);
+			/*
+			 * The resource owner is about to be destroyed, so we must not
+			 * retain a pointer to it.
+			 */
+			portal->resowner = NULL;
 			continue;
 		}
 
 		/*
+		 * Drop the portal.
+		 *
+		 * We used to postpone the actual removal of the portal until
+		 * CleanupTransaction() time, but there's not much point, because
+		 * until then the transaction is in a failed state and existing portals
+		 * can't be used anyway.
+		 *
+		 * (Maybe it would be a good idea to rearrange things so that a
+		 * transaction statement whose portal was created before the abort
+		 * could still be used afterwards to roll back the transaction, but
+		 * historically we haven't cared to support that case.)
+		 *
 		 * If a portal is still pinned, forcibly unpin it. PortalDrop will not
 		 * let us drop the portal otherwise. Whoever pinned the portal was
 		 * interrupted by the abort too and won't try to use it anymore.
 		 */
-		if (portal->portalPinned)
-			portal->portalPinned = false;
-
-		/*
-		 * We had better not call any user-defined code during cleanup, so if
-		 * the cleanup hook hasn't been run yet, too bad; we'll just skip it.
-		 */
-		if (PointerIsValid(portal->cleanup))
-		{
-			elog(WARNING, "skipping cleanup for portal \"%s\"", portal->name);
-			portal->cleanup = NULL;
-		}
-
-		/* Zap it. */
+		portal->portalPinned = false;
 		PortalDrop(portal, false);
 	}
 }
@@ -958,11 +903,9 @@ AtSubCommit_Portals(SubTransactionId mySubid,
 /*
  * Subtransaction abort handling for portals.
  *
- * Deactivate portals created or used during the failed subtransaction.
+ * Destroy portals created or used during the failed subtransaction.
  * Note that per AtSubCommit_Portals, this will catch portals created/used
  * in descendants of the subtransaction too.
- *
- * We don't destroy any portals here; that's done in AtSubCleanup_Portals.
  */
 void
 AtSubAbort_Portals(SubTransactionId mySubid,
@@ -979,6 +922,40 @@ AtSubAbort_Portals(SubTransactionId mySubid,
 	{
 		Portal		portal = hentry->portal;
 
+		/*
+		 * If we're exiting due to a FATAL error, just blow away all portals,
+		 * same as in AtAbort_Portals.
+		 */
+		if (shmem_exit_inprogress)
+		{
+			PortalHashTableDelete(portal);
+			continue;
+		}
+
+		/*
+		 * If we find an active portal, it means that the calling code is
+		 * trying to roll back a subtransaction in mid-command. Note that we
+		 * wouldn't reach this case on ERROR, because an ERROR while running
+		 * a portal puts it into a failed state, nor would we reach it from
+		 * executing a normal ROLLBACK TO SAVEPOINT command, which arranges
+		 * to do the real work after the portal is no longer active.  But
+		 * C code that manipulates transactions explicitly could cause us
+		 * to reach this.
+		 *
+		 * We would have big problems here if this portal depends on any
+		 * resources proper to the current subtransaction, but let's leave
+		 * that problem to the (hypothetical) caller.
+		 */
+		if (portal->status == PORTAL_ACTIVE)
+		{
+			/*
+			 * This would be nonsensical if the portal had been created in
+			 * the current subtransaction.
+			 */
+			Assert(portal->createSubid != mySubid);
+			continue;
+		}
+
 		/* Was it created in this subtransaction? */
 		if (portal->createSubid != mySubid)
 		{
@@ -988,25 +965,6 @@ AtSubAbort_Portals(SubTransactionId mySubid,
 				/* Maintain activeSubid until the portal is removed */
 				portal->activeSubid = parentSubid;
 
-				/*
-				 * A MarkPortalActive() caller ran an upper-level portal in
-				 * this subtransaction and left the portal ACTIVE.  This can't
-				 * happen, but force the portal into FAILED state for the same
-				 * reasons discussed below.
-				 *
-				 * We assume we can get away without forcing upper-level READY
-				 * portals to fail, even if they were run and then suspended.
-				 * In theory a suspended upper-level portal could have
-				 * acquired some references to objects that are about to be
-				 * destroyed, but there should be sufficient defenses against
-				 * such cases: the portal's original query cannot contain such
-				 * references, and any references within, say, cached plans of
-				 * PL/pgSQL functions are not from active queries and should
-				 * be protected by revalidation logic.
-				 */
-				if (portal->status == PORTAL_ACTIVE)
-					MarkPortalFailed(portal);
-
 				/*
 				 * Also, if we failed it during the current subtransaction
 				 * (either just above, or earlier), reattach its resource
@@ -1028,89 +986,10 @@ AtSubAbort_Portals(SubTransactionId mySubid,
 		}
 
 		/*
-		 * Force any live portals of my own subtransaction into FAILED state.
-		 * We have to do this because they might refer to objects created or
-		 * changed in the failed subtransaction, leading to crashes within
-		 * ExecutorEnd when portalcmds.c tries to close down the portal.
-		 * Currently, every MarkPortalActive() caller ensures it updates the
-		 * portal status again before relinquishing control, so ACTIVE can't
-		 * happen here.  If it does happen, dispose the portal like existing
-		 * MarkPortalActive() callers would.
+		 * Portals created in this subtransaction should be dropped, even
+		 * if pinned. See comments in AtAbort_Portals for more details.
 		 */
-		if (portal->status == PORTAL_READY ||
-			portal->status == PORTAL_ACTIVE)
-			MarkPortalFailed(portal);
-
-		/*
-		 * Allow portalcmds.c to clean up the state it knows about, if we
-		 * haven't already.
-		 */
-		if (PointerIsValid(portal->cleanup))
-		{
-			portal->cleanup(portal);
-			portal->cleanup = NULL;
-		}
-
-		/* drop cached plan reference, if any */
-		PortalReleaseCachedPlan(portal);
-
-		/*
-		 * Any resources belonging to the portal will be released in the
-		 * upcoming transaction-wide cleanup; they will be gone before we run
-		 * PortalDrop.
-		 */
-		portal->resowner = NULL;
-
-		/*
-		 * Although we can't delete the portal data structure proper, we can
-		 * release any memory in subsidiary contexts, such as executor state.
-		 * The cleanup hook was the last thing that might have needed data
-		 * there.
-		 */
-		MemoryContextDeleteChildren(portal->portalContext);
-	}
-}
-
-/*
- * Post-subabort cleanup for portals.
- *
- * Drop all portals created in the failed subtransaction (but note that
- * we will not drop any that were reassigned to the parent above).
- */
-void
-AtSubCleanup_Portals(SubTransactionId mySubid)
-{
-	HASH_SEQ_STATUS status;
-	PortalHashEnt *hentry;
-
-	hash_seq_init(&status, PortalHashTable);
-
-	while ((hentry = (PortalHashEnt *) hash_seq_search(&status)) != NULL)
-	{
-		Portal		portal = hentry->portal;
-
-		if (portal->createSubid != mySubid)
-			continue;
-
-		/*
-		 * If a portal is still pinned, forcibly unpin it. PortalDrop will not
-		 * let us drop the portal otherwise. Whoever pinned the portal was
-		 * interrupted by the abort too and won't try to use it anymore.
-		 */
-		if (portal->portalPinned)
-			portal->portalPinned = false;
-
-		/*
-		 * We had better not call any user-defined code during cleanup, so if
-		 * the cleanup hook hasn't been run yet, too bad; we'll just skip it.
-		 */
-		if (PointerIsValid(portal->cleanup))
-		{
-			elog(WARNING, "skipping cleanup for portal \"%s\"", portal->name);
-			portal->cleanup = NULL;
-		}
-
-		/* Zap it. */
+		portal->portalPinned = false;
 		PortalDrop(portal, false);
 	}
 }
diff --git a/src/include/utils/portal.h b/src/include/utils/portal.h
index 098c837e0a..f8eeb96219 100644
--- a/src/include/utils/portal.h
+++ b/src/include/utils/portal.h
@@ -205,7 +205,6 @@ typedef struct PortalData
 extern void EnablePortalManager(void);
 extern bool PreCommit_Portals(bool isPrepare);
 extern void AtAbort_Portals(void);
-extern void AtCleanup_Portals(void);
 extern void PortalErrorCleanup(void);
 extern void AtSubCommit_Portals(SubTransactionId mySubid,
 								SubTransactionId parentSubid,
@@ -214,7 +213,6 @@ extern void AtSubAbort_Portals(SubTransactionId mySubid,
 							   SubTransactionId parentSubid,
 							   ResourceOwner myXactOwner,
 							   ResourceOwner parentXactOwner);
-extern void AtSubCleanup_Portals(SubTransactionId mySubid);
 extern Portal CreatePortal(const char *name, bool allowDup, bool dupSilent);
 extern Portal CreateNewPortal(void);
 extern void PinPortal(Portal portal);
-- 
2.17.2 (Apple Git-113)

Robert Haas

robertmhaas@gmail.com

over 6 years ago

In reply to: Dilip Kumar (#3)

Re: abort-time portal cleanup

On Tue, Sep 24, 2019 at 6:34 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Fri, Sep 13, 2019 at 2:13 AM Robert Haas <robertmhaas@gmail.com> wrote:

/*
* Otherwise, do nothing to cursors held over from a previous
* transaction.
*/
if (portal->createSubid == InvalidSubTransactionId)
continue;

/*
* Do nothing to auto-held cursors. This is similar to the case of a
* cursor from a previous transaction, but it could also be that the
* cursor was auto-held in this transaction, so it wants to live on.
*/
if (portal->autoHeld)
continue;

I have one doubt that why do we need the second check. Because before
setting portal->autoHeld to true we always call HoldPortal therein we
set portal->createSubid to InvalidSubTransactionId. So it seems to me
that the second condition will never reach. Am I missing something?

Not that I can see, but I don't necessarily think this patch needs to
change it, either.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Andres Freund

andres@anarazel.de

over 6 years ago

In reply to: Robert Haas (#6)

Re: abort-time portal cleanup

Hi,

On 2019-10-07 12:14:52 -0400, Robert Haas wrote:

- if (portal->status == PORTAL_READY)
- MarkPortalFailed(portal);

Why it is safe to remove this check? It has been explained in commit
7981c342 why we need that check. I don't see any explanation in email
or patch which justifies this code removal. Is it because you removed
PortalCleanup? If so, that is still called from PortalDrop?

All MarkPortalFailed() does is change the status to PORTAL_FAILED and
call the cleanup hook. PortalDrop() calls the cleanup hook, and we
don't need to change the status if we're removing it completely.

Note that currently PortalCleanup() behaves differently depending on
whether the portal is set to failed or not...

- Andres

Robert Haas

robertmhaas@gmail.com

over 6 years ago

In reply to: Andres Freund (#8)

Re: abort-time portal cleanup

On Tue, Oct 8, 2019 at 2:10 PM Andres Freund <andres@anarazel.de> wrote:

On 2019-10-07 12:14:52 -0400, Robert Haas wrote:

- if (portal->status == PORTAL_READY)
- MarkPortalFailed(portal);

Why it is safe to remove this check? It has been explained in commit
7981c342 why we need that check. I don't see any explanation in email
or patch which justifies this code removal. Is it because you removed
PortalCleanup? If so, that is still called from PortalDrop?

All MarkPortalFailed() does is change the status to PORTAL_FAILED and
call the cleanup hook. PortalDrop() calls the cleanup hook, and we
don't need to change the status if we're removing it completely.

Note that currently PortalCleanup() behaves differently depending on
whether the portal is set to failed or not...

Urk, yeah, I forgot about that. I think that's a wretched hack that
somebody ought to untangle at some point, but maybe for purposes of
this patch it makes more sense to just put the MarkPortalFailed call
back.

It's unclear to me why there's a special case here specifically for
PORTAL_READY. Like, why is PORTAL_NEW or PORTAL_DEFINED or
PORTAL_DONE any different? It seems like if we're aborting the
transaction, we should not be calling ExecutorFinish()/ExecutorEnd()
for anything. We could achieve that result by just nulling out the
cleanup hook unconditionally instead of having this complicated dance
where we mark ready portals failed, which calls the cleanup hook,
which decides not to do anything because the portal has been marked
failed. It'd be great if there were a few more comments in this file
explaining what the thinking behind all this was.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#10

Amit Kapila

amit.kapila16@gmail.com

over 6 years ago

In reply to: Robert Haas (#9)

Re: abort-time portal cleanup

On Wed, Oct 9, 2019 at 6:56 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Oct 8, 2019 at 2:10 PM Andres Freund <andres@anarazel.de> wrote:

On 2019-10-07 12:14:52 -0400, Robert Haas wrote:

- if (portal->status == PORTAL_READY)
- MarkPortalFailed(portal);

Why it is safe to remove this check? It has been explained in commit
7981c342 why we need that check. I don't see any explanation in email
or patch which justifies this code removal. Is it because you removed
PortalCleanup? If so, that is still called from PortalDrop?

All MarkPortalFailed() does is change the status to PORTAL_FAILED and
call the cleanup hook. PortalDrop() calls the cleanup hook, and we
don't need to change the status if we're removing it completely.

Note that currently PortalCleanup() behaves differently depending on
whether the portal is set to failed or not...

Yeah, this is the reason, I mentioned it.

Urk, yeah, I forgot about that. I think that's a wretched hack that
somebody ought to untangle at some point, but maybe for purposes of
this patch it makes more sense to just put the MarkPortalFailed call
back.

+1.

It's unclear to me why there's a special case here specifically for
PORTAL_READY. Like, why is PORTAL_NEW or PORTAL_DEFINED or
PORTAL_DONE any different?

If read the commit message of commit 7981c34279 [1]commit 7981c34279fbddc254cfccb9a2eec4b35e692a12 Author: Tom Lane <tgl@sss.pgh.pa.us> Date: Thu Feb 18 03:06:46 2010 +0000 which introduced
this, then we might get some clue. It is quite possible that we need
same handling for PORTAL_NEW, PORTAL_DEFINED, etc. but it seems we
just hit the problem mentioned in commit 7981c34279 for PORTAL_READY
state. I think as per commit, if we don't mark it failed, then with
auto_explain things can go wrong.

[1]: commit 7981c34279fbddc254cfccb9a2eec4b35e692a12 Author: Tom Lane <tgl@sss.pgh.pa.us> Date: Thu Feb 18 03:06:46 2010 +0000
commit 7981c34279fbddc254cfccb9a2eec4b35e692a12
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date: Thu Feb 18 03:06:46 2010 +0000

Force READY portals into FAILED state when a transaction or
subtransaction is aborted, if they were created within the failed
xact. This prevents ExecutorEnd from being run on them, which is a
good idea because they may contain references to tables or other
objects that no longer exist. In particular this is hazardous when
auto_explain is active, but it's really rather surprising that nobody
has seen an issue with this before. I'm back-patching this to 8.4,
since that's the first version that contains auto_explain or an
ExecutorEnd hook, but I wonder whether we shouldn't back-patch
further.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com