[BUG] failed assertion in EnsurePortalSnapshotExists()

Started by Bertrand Drouvotover 4 years ago20 messageshackers
Jump to latest
#1Bertrand Drouvot
bertranddrouvot.pg@gmail.com

Hi hackers,

I recently observed a failed assertion in EnsurePortalSnapshotExists().

The steps to reproduce the issue on the master branch are:

create table bdt (a int primary key);
insert into bdt values (1),(2);
create table bdt2 (a int);
insert into bdt2 values (1);

Then launching:

DO $$
BEGIN
  FOR i IN 1..2 LOOP
    BEGIN
      INSERT INTO bdt (a) VALUES (i);
       exception when unique_violation then update bdt2 set a = i;
     COMMIT;
    END;
  END LOOP;
END;
$$;

Would produce:

server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
The connection to the server was lost

Due to:

#2  0x0000000000b2ffcb in ExceptionalCondition (conditionName=0xd0d598
"portal->portalSnapshot == NULL", errorType=0xd0d0b3 "FailedAssertion",
fileName=0xd0d174 "pquery.c", lineNumber=1785) at assert.c:69
#3  0x000000000099e666 in EnsurePortalSnapshotExists () at pquery.c:1785

From what i have seen, we end up having ActiveSnapshot set to NULL in
AtSubAbort_Snapshot() (while we still have ActivePortal->portalSnapshot
not being NULL and not set to NULL later on).

That leads to ActiveSnapshotSet() not being true in the next call to
EnsurePortalSnapshotExists() and leads to the failed assertion (checking
that ActivePortal->portalSnapshot is NULL) later on in the code.

Based on this, i have created the attached patch (which fixes the issue
mentioned in the repro) but I have the feeling that I may have missed
something more important here (that would not be addressed with the
attached patch), thoughts?

Thanks

Bertrand

Attachments:

v1-0001-EnsurePortalSnapshotExists-failed-assertion.patchtext/plain; charset=UTF-8; name=v1-0001-EnsurePortalSnapshotExists-failed-assertion.patchDownload+5-0
#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bertrand Drouvot (#1)
Re: [BUG] failed assertion in EnsurePortalSnapshotExists()

"Drouvot, Bertrand" <bdrouvot@amazon.com> writes:

I recently observed a failed assertion in EnsurePortalSnapshotExists().

Hmm, interesting. If I take out the "update bdt2" step, so that the
exception clause is just COMMIT, then I get something different:

ERROR: portal snapshots (1) did not account for all active snapshots (0)
CONTEXT: PL/pgSQL function inline_code_block line 8 at COMMIT

I think perhaps plpgsql's exception handling needs a bit of adjustment,
but not sure what yet.

regards, tom lane

#3Bertrand Drouvot
bertranddrouvot.pg@gmail.com
In reply to: Tom Lane (#2)
Re: [BUG] failed assertion in EnsurePortalSnapshotExists()

Hi,

On 9/27/21 9:44 PM, Tom Lane wrote:

"Drouvot, Bertrand" <bdrouvot@amazon.com> writes:

I recently observed a failed assertion in EnsurePortalSnapshotExists().

Hmm, interesting.

Thanks for looking at it!

If I take out the "update bdt2" step, so that the
exception clause is just COMMIT, then I get something different:

ERROR: portal snapshots (1) did not account for all active snapshots (0)
CONTEXT: PL/pgSQL function inline_code_block line 8 at COMMIT

FWIW, I just gave it a try and it looks like this is also "fixed" by the
proposed patch.

Does it make sense (as it is currently) to set the ActiveSnapshot to
NULL and not ensuring the same is done for ActivePortal->portalSnapshot?

Or does it mean we should not reach a state where we set ActiveSnapshot
to NULL while ActivePortal->portalSnapshot is not already NULL?

Thanks

Bertrand

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bertrand Drouvot (#3)
Re: [BUG] failed assertion in EnsurePortalSnapshotExists()

"Drouvot, Bertrand" <bdrouvot@amazon.com> writes:

Does it make sense (as it is currently) to set the ActiveSnapshot to
NULL and not ensuring the same is done for ActivePortal->portalSnapshot?

I think that patch is just a kluge :-(

After tracing through this I've figured out what is happening, and
why you need to put the exception block inside a loop to make it
happen. The first iteration goes fine, and we end up with an empty
ActiveSnapshot stack (and no portalSnapshots) because that's how
the COMMIT leaves it. In the second iteration, we try to
re-establish the portal snapshot, but at the point where
EnsurePortalSnapshotExists is called (from the first INSERT
command) we are already inside a subtransaction thanks to the
plpgsql exception block. This means that the stacked ActiveSnapshot
has as_level = 2, although the Portal owning it belongs to the
outer transaction. So at the next exception, AtSubAbort_Snapshot
zaps the stacked ActiveSnapshot, but the Portal stays alive and
now it has a dangling portalSnapshot pointer.

Basically there seem to be two ways to fix this, both requiring
API additions to snapmgr.c (hence, cc'ing Alvaro for opinion):

1. Provide a variant of PushActiveSnapshot that allows the caller
to specify the as_level to use, and then have
EnsurePortalSnapshotExists, as well as other places that create
Portal-associated snapshots, use this with as_level equal to the
Portal's createSubid. The main hazard I see here is that this could
theoretically allow the ActiveSnapshot stack to end up with
out-of-order as_level values, causing issues. For the moment we
could probably just add assertions to check that the passed as_level
is >= next level, or maybe even that this variant is only called with
empty ActiveSnapshot stack.

2. Provide a way for AtSubAbort_Portals to detect whether a
portalSnapshot pointer points to a snap that's going to be
deleted by AtSubAbort_Snapshot, and then just have it clear any
portalSnapshots that are about to become dangling. (This'd amount
to accepting the possibility that portalSnapshot is of a different
subxact level from the portal, and dealing with the situation.)

I initially thought #2 would be the way to go, but it turns out
to be a bit messy since what we have is a Snapshot pointer not an
ActiveSnapshotElt pointer. We'd have to do something like search the
ActiveSnapshot stack looking for pointer equality to the caller's
Snapshot pointer, which seems fragile --- do we assume as_snap is
unique for any other purpose?

That being the case, I'm now leaning to #1. Thoughts?

regards, tom lane

#5Bertrand Drouvot
bertranddrouvot.pg@gmail.com
In reply to: Tom Lane (#4)
Re: [BUG] failed assertion in EnsurePortalSnapshotExists()

Hi,

On 9/28/21 6:50 PM, Tom Lane wrote:

"Drouvot, Bertrand" <bdrouvot@amazon.com> writes:

Does it make sense (as it is currently) to set the ActiveSnapshot to
NULL and not ensuring the same is done for ActivePortal->portalSnapshot?

I think that patch is just a kluge :-(

right, sounded too simple to be "true".

After tracing through this I've figured out what is happening, and
why you need to put the exception block inside a loop to make it
happen. The first iteration goes fine, and we end up with an empty
ActiveSnapshot stack (and no portalSnapshots) because that's how
the COMMIT leaves it. In the second iteration, we try to
re-establish the portal snapshot, but at the point where
EnsurePortalSnapshotExists is called (from the first INSERT
command) we are already inside a subtransaction thanks to the
plpgsql exception block. This means that the stacked ActiveSnapshot
has as_level = 2, although the Portal owning it belongs to the
outer transaction. So at the next exception, AtSubAbort_Snapshot
zaps the stacked ActiveSnapshot, but the Portal stays alive and
now it has a dangling portalSnapshot pointer.

Thanks for the explanation!

Basically there seem to be two ways to fix this, both requiring
API additions to snapmgr.c (hence, cc'ing Alvaro for opinion):

1. Provide a variant of PushActiveSnapshot that allows the caller
to specify the as_level to use, and then have
EnsurePortalSnapshotExists, as well as other places that create
Portal-associated snapshots, use this with as_level equal to the
Portal's createSubid. The main hazard I see here is that this could
theoretically allow the ActiveSnapshot stack to end up with
out-of-order as_level values, causing issues. For the moment we
could probably just add assertions to check that the passed as_level
is >= next level, or maybe even that this variant is only called with
empty ActiveSnapshot stack.

I think we may get a non empty ActiveSnapshot stack with prepared
statements, so tempted to do the assertion on the levels.

2. Provide a way for AtSubAbort_Portals to detect whether a
portalSnapshot pointer points to a snap that's going to be
deleted by AtSubAbort_Snapshot, and then just have it clear any
portalSnapshots that are about to become dangling. (This'd amount
to accepting the possibility that portalSnapshot is of a different
subxact level from the portal, and dealing with the situation.)

I initially thought #2 would be the way to go, but it turns out
to be a bit messy since what we have is a Snapshot pointer not an
ActiveSnapshotElt pointer. We'd have to do something like search the
ActiveSnapshot stack looking for pointer equality to the caller's
Snapshot pointer, which seems fragile --- do we assume as_snap is
unique for any other purpose?

That being the case, I'm now leaning to #1. Thoughts?

I'm also inclined to #1.

Please find attached a patch proposal for #1 that also adds a new test.

Thanks

Bertrand

Attachments:

v2-0001-EnsurePortalSnapshotExists-failed-assertion.patchtext/plain; charset=UTF-8; name=v2-0001-EnsurePortalSnapshotExists-failed-assertion.patchDownload+88-9
#6Ranier Vilela
ranier.vf@gmail.com
In reply to: Bertrand Drouvot (#5)
Re: [BUG] failed assertion in EnsurePortalSnapshotExists()

Em qua., 29 de set. de 2021 às 06:55, Drouvot, Bertrand <bdrouvot@amazon.com>
escreveu:

Hi,

On 9/28/21 6:50 PM, Tom Lane wrote:

"Drouvot, Bertrand" <bdrouvot@amazon.com> writes:

Does it make sense (as it is currently) to set the ActiveSnapshot to
NULL and not ensuring the same is done for ActivePortal->portalSnapshot?

I think that patch is just a kluge :-(

right, sounded too simple to be "true".

After tracing through this I've figured out what is happening, and
why you need to put the exception block inside a loop to make it
happen. The first iteration goes fine, and we end up with an empty
ActiveSnapshot stack (and no portalSnapshots) because that's how
the COMMIT leaves it. In the second iteration, we try to
re-establish the portal snapshot, but at the point where
EnsurePortalSnapshotExists is called (from the first INSERT
command) we are already inside a subtransaction thanks to the
plpgsql exception block. This means that the stacked ActiveSnapshot
has as_level = 2, although the Portal owning it belongs to the
outer transaction. So at the next exception, AtSubAbort_Snapshot
zaps the stacked ActiveSnapshot, but the Portal stays alive and
now it has a dangling portalSnapshot pointer.

Thanks for the explanation!

Basically there seem to be two ways to fix this, both requiring
API additions to snapmgr.c (hence, cc'ing Alvaro for opinion):

1. Provide a variant of PushActiveSnapshot that allows the caller
to specify the as_level to use, and then have
EnsurePortalSnapshotExists, as well as other places that create
Portal-associated snapshots, use this with as_level equal to the
Portal's createSubid. The main hazard I see here is that this could
theoretically allow the ActiveSnapshot stack to end up with
out-of-order as_level values, causing issues. For the moment we
could probably just add assertions to check that the passed as_level
is >= next level, or maybe even that this variant is only called with
empty ActiveSnapshot stack.

I think we may get a non empty ActiveSnapshot stack with prepared
statements, so tempted to do the assertion on the levels.

2. Provide a way for AtSubAbort_Portals to detect whether a
portalSnapshot pointer points to a snap that's going to be
deleted by AtSubAbort_Snapshot, and then just have it clear any
portalSnapshots that are about to become dangling. (This'd amount
to accepting the possibility that portalSnapshot is of a different
subxact level from the portal, and dealing with the situation.)

I initially thought #2 would be the way to go, but it turns out
to be a bit messy since what we have is a Snapshot pointer not an
ActiveSnapshotElt pointer. We'd have to do something like search the
ActiveSnapshot stack looking for pointer equality to the caller's
Snapshot pointer, which seems fragile --- do we assume as_snap is
unique for any other purpose?

That being the case, I'm now leaning to #1. Thoughts?

I'm also inclined to #1.

I have a stupid question, why duplicate PushActiveSnapshot?
Wouldn't one function be better?

PushActiveSnapshot(Snapshot snap, int as_level);

Sample calls:
PushActiveSnapshot(GetTransactionSnapshot(),
GetCurrentTransactionNestLevel());
PushActiveSnapshot(queryDesc->snapshot, GetCurrentTransactionNestLevel());
PushActiveSnapshot(GetTransactionSnapshot(), portal->createSubid);

regards,
Ranier Vilela

#7Bertrand Drouvot
bertranddrouvot.pg@gmail.com
In reply to: Ranier Vilela (#6)
Re: [BUG] failed assertion in EnsurePortalSnapshotExists()

Hi,

On 9/29/21 12:59 PM, Ranier Vilela wrote:

Em qua., 29 de set. de 2021 às 06:55, Drouvot, Bertrand
<bdrouvot@amazon.com> escreveu:

I'm also inclined to #1.

I have a stupid question, why duplicate PushActiveSnapshot?
Wouldn't one function be better?

PushActiveSnapshot(Snapshot snap, int as_level);

Sample calls:
PushActiveSnapshot(GetTransactionSnapshot(),
GetCurrentTransactionNestLevel());
PushActiveSnapshot(queryDesc->snapshot,
GetCurrentTransactionNestLevel());
PushActiveSnapshot(GetTransactionSnapshot(), portal->createSubid);

I would say because that could "break" existing extensions for example.

Adding a new function prevents "updating" existing extensions making use
of PushActiveSnapshot().

Thanks

Bertrand

#8Ranier Vilela
ranier.vf@gmail.com
In reply to: Bertrand Drouvot (#7)
Re: [BUG] failed assertion in EnsurePortalSnapshotExists()

Em qua., 29 de set. de 2021 às 08:12, Drouvot, Bertrand <bdrouvot@amazon.com>
escreveu:

Hi,
On 9/29/21 12:59 PM, Ranier Vilela wrote:

Em qua., 29 de set. de 2021 às 06:55, Drouvot, Bertrand <
bdrouvot@amazon.com> escreveu:

I'm also inclined to #1.

I have a stupid question, why duplicate PushActiveSnapshot?
Wouldn't one function be better?

PushActiveSnapshot(Snapshot snap, int as_level);

Sample calls:
PushActiveSnapshot(GetTransactionSnapshot(),
GetCurrentTransactionNestLevel());
PushActiveSnapshot(queryDesc->snapshot, GetCurrentTransactionNestLevel());
PushActiveSnapshot(GetTransactionSnapshot(), portal->createSubid);

I would say because that could "break" existing extensions for example.

Adding a new function prevents "updating" existing extensions making use
of PushActiveSnapshot().

Valid argument of course.
But the extensions should also fit the core code.
Duplicating functions is very bad for maintenance and bloats the code
unnecessarily, IMHO.

regards,
Ranier Vilela

#9Bertrand Drouvot
bertranddrouvot.pg@gmail.com
In reply to: Ranier Vilela (#8)
Re: [BUG] failed assertion in EnsurePortalSnapshotExists()

On 9/29/21 1:23 PM, Ranier Vilela wrote:

Em qua., 29 de set. de 2021 às 08:12, Drouvot, Bertrand
<bdrouvot@amazon.com> escreveu:

Hi,

On 9/29/21 12:59 PM, Ranier Vilela wrote:

Em qua., 29 de set. de 2021 às 06:55, Drouvot, Bertrand
<bdrouvot@amazon.com> escreveu:

I'm also inclined to #1.

I have a stupid question, why duplicate PushActiveSnapshot?
Wouldn't one function be better?

PushActiveSnapshot(Snapshot snap, int as_level);

Sample calls:
PushActiveSnapshot(GetTransactionSnapshot(),
GetCurrentTransactionNestLevel());
PushActiveSnapshot(queryDesc->snapshot,
GetCurrentTransactionNestLevel());
PushActiveSnapshot(GetTransactionSnapshot(), portal->createSubid);

I would say because that could "break" existing extensions for
example.

Adding a new function prevents "updating" existing extensions
making use of PushActiveSnapshot().

Valid argument of course.
But the extensions should also fit the core code.
Duplicating functions is very bad for maintenance and bloats the code
unnecessarily, IMHO.

Right. I don't have a strong opinion about this.

Let's see what Tom, Alvaro or others arguments/opinions are (should they
also want to go with option #1).

Thanks

Bertrand

#10Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Tom Lane (#4)
Re: [BUG] failed assertion in EnsurePortalSnapshotExists()

On 2021-Sep-28, Tom Lane wrote:

1. Provide a variant of PushActiveSnapshot that allows the caller
to specify the as_level to use, and then have
EnsurePortalSnapshotExists, as well as other places that create
Portal-associated snapshots, use this with as_level equal to the
Portal's createSubid. The main hazard I see here is that this could
theoretically allow the ActiveSnapshot stack to end up with
out-of-order as_level values, causing issues. For the moment we
could probably just add assertions to check that the passed as_level
is >= next level, or maybe even that this variant is only called with
empty ActiveSnapshot stack.

I don't see anything wrong with this idea offhand.

I didn't try to create scenarios with out-of-order as_level active
snapshots, but with all the creativity out there in the world, and
considering that it's possible to write procedures in C, I think that
asserting that the order is maintained is warranted.

Now if we do meet a case with out-of-order levels, what do we do? I
suppose we'd need to update AtSubCommit_Snapshot and AtSubAbort_Snapshot
to cope with that (but by all means let's wait until we have a test case
where that happens ...)

2. Provide a way for AtSubAbort_Portals to detect whether a
portalSnapshot pointer points to a snap that's going to be
deleted by AtSubAbort_Snapshot, and then just have it clear any
portalSnapshots that are about to become dangling. (This'd amount
to accepting the possibility that portalSnapshot is of a different
subxact level from the portal, and dealing with the situation.)

I initially thought #2 would be the way to go, but it turns out
to be a bit messy since what we have is a Snapshot pointer not an
ActiveSnapshotElt pointer. We'd have to do something like search the
ActiveSnapshot stack looking for pointer equality to the caller's
Snapshot pointer, which seems fragile --- do we assume as_snap is
unique for any other purpose?

I don't remember what patch it was, but I remember contemplating
sometime during the past year the possibility of snapshots being used
twice in the active stack. Now maybe this is not possible in practice
because in most cases we create a copy, but I couldn't swear that that's
always the case. I wouldn't rely on that.

--
Álvaro Herrera Valdivia, Chile — https://www.EnterpriseDB.com/
"Las cosas son buenas o malas segun las hace nuestra opinión" (Lisias)

#11Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Ranier Vilela (#8)
Re: [BUG] failed assertion in EnsurePortalSnapshotExists()

On 2021-Sep-29, Ranier Vilela wrote:

Em qua., 29 de set. de 2021 às 08:12, Drouvot, Bertrand <bdrouvot@amazon.com>
escreveu:

Adding a new function prevents "updating" existing extensions making use
of PushActiveSnapshot().

Valid argument of course.
But the extensions should also fit the core code.
Duplicating functions is very bad for maintenance and bloats the code
unnecessarily, IMHO.

Well, there are 42 calls of PushActiveSnapshot currently, and only 6 are
updated in the patch. Given that six sevenths of the calls continue to
use the existing function and that it is less verbose than the new one,
that seems sufficient argument to keep it.

--
Álvaro Herrera 39°49'30"S 73°17'W — https://www.EnterpriseDB.com/
"How amazing is that? I call it a night and come back to find that a bug has
been identified and patched while I sleep." (Robert Davidson)
http://archives.postgresql.org/pgsql-sql/2006-03/msg00378.php

#12Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#11)
Re: [BUG] failed assertion in EnsurePortalSnapshotExists()

Alvaro Herrera <alvherre@2ndquadrant.com> writes:

On 2021-Sep-29, Ranier Vilela wrote:

Em qua., 29 de set. de 2021 às 08:12, Drouvot, Bertrand <bdrouvot@amazon.com>
escreveu:
Duplicating functions is very bad for maintenance and bloats the code
unnecessarily, IMHO.

Well, there are 42 calls of PushActiveSnapshot currently, and only 6 are
updated in the patch. Given that six sevenths of the calls continue to
use the existing function and that it is less verbose than the new one,
that seems sufficient argument to keep it.

Seeing that we have to back-patch this, changing the ABI of
PushActiveSnapshot seems like a complete non-starter.

The idea I'd had to avoid code duplication was to make
PushActiveSnapshot a wrapper for the extended function:

void
PushActiveSnapshot(Snapshot snap)
{
PushActiveSnapshotWithLevel(snap, GetCurrentTransactionNestLevel());
}

This would add one function call to the common code path, but there
are enough function calls in PushActiveSnapshot that I don't think
that's a big concern.

Another point is that this'd also add the as_level ordering assertion
to the common code path, but on the whole I think that's good not bad.

BTW, this is not great code:

+	if (ActiveSnapshot != NULL && ActiveSnapshot->as_next != NULL)
+		Assert(as_level >= ActiveSnapshot->as_next->as_level);

You want it all wrapped in the Assert, so that there's not any code
left in a non-assert build (which the compiler may or may not optimize
away, perhaps after complaining about a side-effect-free statement).

Actually, it's plain wrong, because you should be looking at the
top as_level not the next one. So more like

Assert(ActiveSnapshot == NULL ||
snap_level >= ActiveSnapshot->as_level);

regards, tom lane

#13Ranier Vilela
ranier.vf@gmail.com
In reply to: Tom Lane (#12)
Re: [BUG] failed assertion in EnsurePortalSnapshotExists()

Em qua., 29 de set. de 2021 às 15:01, Tom Lane <tgl@sss.pgh.pa.us> escreveu:

Alvaro Herrera <alvherre@2ndquadrant.com> writes:

On 2021-Sep-29, Ranier Vilela wrote:

Em qua., 29 de set. de 2021 às 08:12, Drouvot, Bertrand <

bdrouvot@amazon.com>

escreveu:
Duplicating functions is very bad for maintenance and bloats the code
unnecessarily, IMHO.

Well, there are 42 calls of PushActiveSnapshot currently, and only 6 are
updated in the patch. Given that six sevenths of the calls continue to
use the existing function and that it is less verbose than the new one,
that seems sufficient argument to keep it.

Seeing that we have to back-patch this, changing the ABI of
PushActiveSnapshot seems like a complete non-starter.

The idea I'd had to avoid code duplication was to make
PushActiveSnapshot a wrapper for the extended function:

void
PushActiveSnapshot(Snapshot snap)
{
PushActiveSnapshotWithLevel(snap, GetCurrentTransactionNestLevel());
}

Much better.

regards,
Ranier Vilela

#14Bertrand Drouvot
bertranddrouvot.pg@gmail.com
In reply to: Tom Lane (#12)
Re: [BUG] failed assertion in EnsurePortalSnapshotExists()

On 9/29/21 8:01 PM, Tom Lane wrote:

Alvaro Herrera <alvherre@2ndquadrant.com> writes:

On 2021-Sep-29, Ranier Vilela wrote:

Em qua., 29 de set. de 2021 às 08:12, Drouvot, Bertrand <bdrouvot@amazon.com>
escreveu:
Duplicating functions is very bad for maintenance and bloats the code
unnecessarily, IMHO.

Well, there are 42 calls of PushActiveSnapshot currently, and only 6 are
updated in the patch. Given that six sevenths of the calls continue to
use the existing function and that it is less verbose than the new one,
that seems sufficient argument to keep it.

Seeing that we have to back-patch this, changing the ABI of
PushActiveSnapshot seems like a complete non-starter.

The idea I'd had to avoid code duplication was to make
PushActiveSnapshot a wrapper for the extended function:

void
PushActiveSnapshot(Snapshot snap)
{
PushActiveSnapshotWithLevel(snap, GetCurrentTransactionNestLevel());
}

Implemented into the new attached patch.

So more like

Assert(ActiveSnapshot == NULL ||
snap_level >= ActiveSnapshot->as_level);

Implemented into the new attached patch.

But make check is now failing on join_hash.sql, I have been able to
repro with:

create table bdt (a int);
begin;
savepoint a;
rollback to a;
explain select count(*) from bdt;

Which triggers a failed assertion on the new one:

TRAP: FailedAssertion("ActiveSnapshot == NULL || as_level >=
ActiveSnapshot->as_level"

because we have as_level = 2 while ActiveSnapshot->as_level = 3.

Thanks

Bertrand

Attachments:

v2-0002-EnsurePortalSnapshotExists-failed-assertion.patchtext/plain; charset=UTF-8; name=v2-0002-EnsurePortalSnapshotExists-failed-assertion.patchDownload+63-11
#15Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bertrand Drouvot (#14)
Re: [BUG] failed assertion in EnsurePortalSnapshotExists()

"Drouvot, Bertrand" <bdrouvot@amazon.com> writes:

But make check is now failing on join_hash.sql, I have been able to
repro with:

Oh, duh, should have thought a bit harder. createSubid is a sequential
subtransaction number; it's not the same as the as_level nesting level.

Probably the most effective way to handle this is to add a subtransaction
nesting-level field to struct Portal, so we can pass that. I don't recall
that xact.c provides any easy way to extract the nesting level of a
subtransaction that's not the most closely nested one.

regards, tom lane

#16Bertrand Drouvot
bertranddrouvot.pg@gmail.com
In reply to: Tom Lane (#15)
Re: [BUG] failed assertion in EnsurePortalSnapshotExists()

On 9/29/21 10:11 PM, Tom Lane wrote:

"Drouvot, Bertrand" <bdrouvot@amazon.com> writes:

But make check is now failing on join_hash.sql, I have been able to
repro with:

Oh, duh, should have thought a bit harder. createSubid is a sequential
subtransaction number; it's not the same as the as_level nesting level.

Oh right, thanks for the explanation.

Probably the most effective way to handle this is to add a subtransaction
nesting-level field to struct Portal, so we can pass that.

Agree, done that way in the new attached patch.

Thanks

Bertrand

Attachments:

v2-0003-EnsurePortalSnapshotExists-failed-assertion.patchtext/plain; charset=UTF-8; name=v2-0003-EnsurePortalSnapshotExists-failed-assertion.patchDownload+72-14
#17Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bertrand Drouvot (#16)
Re: [BUG] failed assertion in EnsurePortalSnapshotExists()

"Drouvot, Bertrand" <bdrouvot@amazon.com> writes:

[ v2-0003-EnsurePortalSnapshotExists-failed-assertion.patch ]

Looking through this, I think you were overenthusiastic about applying
PushActiveSnapshotWithLevel. We don't really need to use it except in
the places where we're setting portalSnapshot, because other than those
cases we don't have a risk of portalSnapshot becoming a dangling pointer.
Also, I'm quite nervous about applying PushActiveSnapshotWithLevel to
snapshots that aren't created by the portal machinery itself, because
we don't know very much about where passed-in snapshots came from or
what the caller thinks their lifespan is.

The attached revision therefore backs off to only using the new code
in the two places where we really need it. I made a number of
more-cosmetic revisions too. Notably, I think it's useful to frame
the testing shortcoming as "we were not testing COMMIT/ROLLBACK
inside a plpgsql exception block". So I moved the test code to the
plpgsql tests and made it check ROLLBACK too.

regards, tom lane

PS: Memo to self: in the back branches, the new field has to be
added at the end of struct Portal.

Attachments:

v3-EnsurePortalSnapshotExists-failed-assertion.patchtext/x-diff; charset=us-ascii; name=v3-EnsurePortalSnapshotExists-failed-assertion.patchDownload+93-8
#18Bertrand Drouvot
bertranddrouvot.pg@gmail.com
In reply to: Tom Lane (#17)
Re: [BUG] failed assertion in EnsurePortalSnapshotExists()

On 9/30/21 7:16 PM, Tom Lane wrote:

"Drouvot, Bertrand" <bdrouvot@amazon.com> writes:

[ v2-0003-EnsurePortalSnapshotExists-failed-assertion.patch ]

Looking through this, I think you were overenthusiastic about applying
PushActiveSnapshotWithLevel. We don't really need to use it except in
the places where we're setting portalSnapshot, because other than those
cases we don't have a risk of portalSnapshot becoming a dangling pointer.
Also, I'm quite nervous about applying PushActiveSnapshotWithLevel to
snapshots that aren't created by the portal machinery itself, because
we don't know very much about where passed-in snapshots came from or
what the caller thinks their lifespan is.

Oh right, I did not think about it, thanks!

The attached revision therefore backs off to only using the new code
in the two places where we really need it.

Ok, so in PortalRunUtility() and EnsurePortalSnapshotExists().

I made a number of
more-cosmetic revisions too.

thanks!

Notably, I think it's useful to frame
the testing shortcoming as "we were not testing COMMIT/ROLLBACK
inside a plpgsql exception block". So I moved the test code to the
plpgsql tests and made it check ROLLBACK too.

Indeed, makes sense.

regards, tom lane

PS: Memo to self: in the back branches, the new field has to be
added at the end of struct Portal.

out of curiosity, why?

Thanks

Bertrand

#19Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bertrand Drouvot (#18)
Re: [BUG] failed assertion in EnsurePortalSnapshotExists()

"Drouvot, Bertrand" <bdrouvot@amazon.com> writes:

On 9/30/21 7:16 PM, Tom Lane wrote:

PS: Memo to self: in the back branches, the new field has to be
added at the end of struct Portal.

out of curiosity, why?

Sticking it into the middle would create an ABI break for any
extension code that's looking at struct Portal, due to movement
of existing field offsets. In HEAD that's fine, so we should
put the field where it makes the most sense. But we have to
be careful about changing globally-visible structs in released
branches.

regards, tom lane

#20Bertrand Drouvot
bertranddrouvot.pg@gmail.com
In reply to: Tom Lane (#19)
Re: [BUG] failed assertion in EnsurePortalSnapshotExists()

On 9/30/21 8:25 PM, Tom Lane wrote:

"Drouvot, Bertrand" <bdrouvot@amazon.com> writes:

On 9/30/21 7:16 PM, Tom Lane wrote:

PS: Memo to self: in the back branches, the new field has to be
added at the end of struct Portal.

out of curiosity, why?

Sticking it into the middle would create an ABI break for any
extension code that's looking at struct Portal, due to movement
of existing field offsets. In HEAD that's fine, so we should
put the field where it makes the most sense. But we have to
be careful about changing globally-visible structs in released
branches.

Got it, thanks!

Bertrand