TRAP: FailedAssertion("HaveRegisteredOrActiveSnapshot()", File: "toast_internals.c", Line: 670, PID: 19403)
Hi,
On master I got a FailedAssertion("HaveRegisteredOrActiveSnapshot()"
on an assert-enabled instance and with (I think) data over a certain length.
I whittled it down to the attached bash (careful - it drops stuff). It
has 5 tsv-data lines (one long line) that COPY slurps into a table. The
middle, third line causes the problem, later on. Shortening the long
line to somewhere below 2000 characters fixes it again.
More info in the attached .sh file.
If debug-assert is 'off', the problem does not occur. (REL_14_STABLE
also does not have the problem, assertions or not)
thanks,
Erik Rijkers
Attachments:
At Sun, 27 Mar 2022 20:32:45 +0200, Erik Rijkers <er@xs4all.nl> wrote in
On master I got a FailedAssertion("HaveRegisteredOrActiveSnapshot()"
on an assert-enabled instance and with (I think) data over a certain
length.I whittled it down to the attached bash (careful - it drops stuff).
It has 5 tsv-data lines (one long line) that COPY slurps into a table.
The middle, third line causes the problem, later on. Shortening the
long line to somewhere below 2000 characters fixes it again.More info in the attached .sh file.
It is reproducible for me. Thanks for the reproducer.
If debug-assert is 'off', the problem does not occur. (REL_14_STABLE
also does not have the problem, assertions or not)
It seems like related with [1]/messages/by-id/20210623035916.GL29179@telsasoft.com?
Inserting EnsurePortalSnapshotExists() to RunFromStore fixes this but
I'm not sure where is the right place to do this..
[1]: /messages/by-id/20210623035916.GL29179@telsasoft.com
For someone's information, this is more readable stack trace.
#0 0x00007f43aeed037f in raise () from /lib64/libc.so.6
#1 0x00007f43aeebadb5 in abort () from /lib64/libc.so.6
#2 0x0000000000b28747 in ExceptionalCondition (
conditionName=0xba2c48 "HaveRegisteredOrActiveSnapshot()",
errorType=0xba2882 "FailedAssertion",
fileName=0xba2870 "toast_internals.c", lineNumber=670) at assert.c:69
#3 0x00000000004ac776 in init_toast_snapshot (toast_snapshot=0x7ffce64f7440)
at toast_internals.c:670
#4 0x00000000005164ea in heap_fetch_toast_slice (toastrel=0x7f43b193cad0,
valueid=16393, attrsize=1848, sliceoffset=0, slicelength=1848,
result=0x1cbb948) at heaptoast.c:688
#5 0x000000000049fc86 in table_relation_fetch_toast_slice (
toastrel=0x7f43b193cad0, valueid=16393, attrsize=1848, sliceoffset=0,
slicelength=1848, result=0x1cbb948)
at ../../../../src/include/access/tableam.h:1892
#6 0x00000000004a0a0f in toast_fetch_datum (attr=0x1d6b171) at detoast.c:375
#7 0x000000000049fffb in detoast_attr (attr=0x1d6b171) at detoast.c:123
#8 0x0000000000b345ba in pg_detoast_datum_packed (datum=0x1d6b171)
at fmgr.c:1757
#9 0x0000000000aece72 in text_to_cstring (t=0x1d6b171) at varlena.c:225
#10 0x0000000000aedda2 in textout (fcinfo=0x7ffce64f77a0) at varlena.c:574
#11 0x0000000000b331bf in FunctionCall1Coll (flinfo=0x1d695e0, collation=0,
arg1=30847345) at fmgr.c:1138
#12 0x0000000000b3422b in OutputFunctionCall (flinfo=0x1d695e0, val=30847345)
at fmgr.c:1575
#13 0x00000000004a6b6c in printtup (slot=0x1cb81f0, self=0x1c96e90)
at printtup.c:357
#14 0x000000000099499f in RunFromStore (portal=0x1cf9380,
direction=ForwardScanDirection, count=0, dest=0x1c96e90) at pquery.c:1096
#15 0x00000000009944e3 in PortalRunSelect (portal=0x1cf9380, forward=true,
count=0, dest=0x1c96e90) at pquery.c:917
#16 0x00000000009941d3 in PortalRun (portal=0x1cf9380,
count=9223372036854775807, isTopLevel=true, run_once=true,
dest=0x1c96e90, altdest=0x1c96e90, qc=0x7ffce64f7ac0) at pquery.c:765
#17 0x000000000098df4b in exec_simple_query (
query_string=0x1c96030 "fetch all in myportal;") at postgres.c:1250
#18 0x00000000009923a3 in PostgresMain (dbname=0x1cc11b0 "postgres",
username=0x1cc1188 "horiguti") at postgres.c:4520
#19 0x00000000008c6caf in BackendRun (port=0x1cb74c0) at postmaster.c:4593
#20 0x00000000008c6631 in BackendStartup (port=0x1cb74c0) at postmaster.c:4321
#21 0x00000000008c29cb in ServerLoop () at postmaster.c:1801
#22 0x00000000008c2298 in PostmasterMain (argc=1, argv=0x1c8e0e0)
at postmaster.c:1473
#23 0x00000000007c14c3 in main (argc=1, argv=0x1c8e0e0) at main.c:202
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
At Mon, 28 Mar 2022 18:36:46 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
Inserting EnsurePortalSnapshotExists() to RunFromStore fixes this but
I'm not sure where is the right place to do this..
Then, I found that portal->holdSnapshot is that. I came up with the
attached. It does the follows:
1. Teach PlannedStmtRequiresSnapshot() to return true for FetchStmt.
2. Use holdSnapshot in RunFromStore if any.
The rerpducer is reduced to as small as the following.
CREATE TABLE t (a text);
INSERT INTO t VALUES('some random text');
BEGIN;
DECLARE c CURSOR FOR SELECT * FROM t;
FETCH ALL IN c;
But I haven't come up with a reasonable way to generate the 'some
random text' yet.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
use_holdsnapshot_to_read_cursor.patchtext/x-patch; charset=us-asciiDownload+19-1
At Tue, 29 Mar 2022 17:06:21 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
At Mon, 28 Mar 2022 18:36:46 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
Then, I found that portal->holdSnapshot is that. I came up with the
attached. It does the follows:1. Teach PlannedStmtRequiresSnapshot() to return true for FetchStmt.
2. Use holdSnapshot in RunFromStore if any.
The rerpducer is reduced to as small as the following.
CREATE TABLE t (a text);
INSERT INTO t VALUES('some random text');
BEGIN;
DECLARE c CURSOR FOR SELECT * FROM t;
FETCH ALL IN c;But I haven't come up with a reasonable way to generate the 'some
random text' yet.
I gave up and took a straightforward way to generate one.
I don't like that it uses a fixed length for the random text, but
anyway it works for now...
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
v1-0001-Appropriately-set-snapshot-on-cursor-fetch.patchtext/x-patch; charset=us-asciiDownload+58-2
On Tue, 29 Mar 2022 at 11:10, Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:
At Tue, 29 Mar 2022 17:06:21 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
At Mon, 28 Mar 2022 18:36:46 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
Then, I found that portal->holdSnapshot is that. I came up with the
attached. It does the follows:1. Teach PlannedStmtRequiresSnapshot() to return true for FetchStmt.
2. Use holdSnapshot in RunFromStore if any.
The rerpducer is reduced to as small as the following.
CREATE TABLE t (a text);
INSERT INTO t VALUES('some random text');
BEGIN;
DECLARE c CURSOR FOR SELECT * FROM t;
FETCH ALL IN c;But I haven't come up with a reasonable way to generate the 'some
random text' yet.I gave up and took a straightforward way to generate one.
I don't like that it uses a fixed length for the random text, but
anyway it works for now...
An shorter (?) reproducer might be the following, which forces any
value for 'a' to be toasted and thus triggering the check in
init_toast_snapshot regardless of value length:
CREATE TABLE t (a text);
ALTER TABLE t ALTER COLUMN a SET STORAGE EXTERNAL;
INSERT INTO t VALUES ('toast');
BEGIN;
DECLARE c CURSOR FOR SELECT * FROM t;
FETCH ALL IN c;
Enjoy,
-Matthias
Op 29-03-2022 om 12:50 schreef Matthias van de Meent:
On Tue, 29 Mar 2022 at 11:10, Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:
At Tue, 29 Mar 2022 17:06:21 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
At Mon, 28 Mar 2022 18:36:46 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
Then, I found that portal->holdSnapshot is that. I came up with the
attached. It does the follows:1. Teach PlannedStmtRequiresSnapshot() to return true for FetchStmt.
2. Use holdSnapshot in RunFromStore if any.
The rerpducer is reduced to as small as the following.
CREATE TABLE t (a text);
INSERT INTO t VALUES('some random text');
BEGIN;
DECLARE c CURSOR FOR SELECT * FROM t;
FETCH ALL IN c;But I haven't come up with a reasonable way to generate the 'some
random text' yet.I gave up and took a straightforward way to generate one.
I don't like that it uses a fixed length for the random text, but
anyway it works for now...An shorter (?) reproducer might be the following, which forces any
value for 'a' to be toasted and thus triggering the check in
init_toast_snapshot regardless of value length:CREATE TABLE t (a text);
ALTER TABLE t ALTER COLUMN a SET STORAGE EXTERNAL;
INSERT INTO t VALUES ('toast');
BEGIN;
DECLARE c CURSOR FOR SELECT * FROM t;
FETCH ALL IN c;
Excellent. That indeed immediately forces the error.
(and the patch prevents it)
Thanks!
Show quoted text
Enjoy,
-Matthias
At Tue, 29 Mar 2022 13:29:15 +0200, Erik Rijkers <er@xs4all.nl> wrote in
Op 29-03-2022 om 12:50 schreef Matthias van de Meent:
An shorter (?) reproducer might be the following, which forces any
value for 'a' to be toasted and thus triggering the check in
init_toast_snapshot regardless of value length:
CREATE TABLE t (a text);
ALTER TABLE t ALTER COLUMN a SET STORAGE EXTERNAL;
INSERT INTO t VALUES ('toast');
BEGIN;
DECLARE c CURSOR FOR SELECT * FROM t;
FETCH ALL IN c;
Yeah, unfortunately I tried that first and saw it didn't work. And it
still doesn't for me. With such a short text pg_detoast_datum_pakced
doesn't call detoast_attr. Actually it is VARATT_IS_1B. (@master)
I think I'm missing something here. I'm going to examine around.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
At Wed, 30 Mar 2022 10:06:17 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
At Tue, 29 Mar 2022 13:29:15 +0200, Erik Rijkers <er@xs4all.nl> wrote in
Op 29-03-2022 om 12:50 schreef Matthias van de Meent:
An shorter (?) reproducer might be the following, which forces any
value for 'a' to be toasted and thus triggering the check in
init_toast_snapshot regardless of value length:
CREATE TABLE t (a text);
ALTER TABLE t ALTER COLUMN a SET STORAGE EXTERNAL;
INSERT INTO t VALUES ('toast');
BEGIN;
DECLARE c CURSOR FOR SELECT * FROM t;
FETCH ALL IN c;Yeah, unfortunately I tried that first and saw it didn't work. And it
still doesn't for me. With such a short text pg_detoast_datum_pakced
doesn't call detoast_attr. Actually it is VARATT_IS_1B. (@master)I think I'm missing something here. I'm going to examine around.
Hmm. Strange. My memory tells that I did the same thing before.. I
thought that it is somewhat related to compression since repeat('x',
4096) didin't seem working at that time, but it worked this time.
Maybe I was confused between extended and external..
But, in the first place the *fix* has been found to be wrong. I'm
going to search for the right fix..
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
At Wed, 30 Mar 2022 11:46:13 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
But, in the first place the *fix* has been found to be wrong. I'm
going to search for the right fix..
FETCH uses the snapshot at DECLARE. So anyhow I needed to set the
queryDesk's snapshot used in PortalRunSelect to the FETCH's portal's
holdSnapshot. What I did in this version is:
1. Add a new member "snapshot" to the type DestReceiver.
2. In PortalRunSelect, set the DECLARE'd query's snapshot to the
member iff the dest is tupelstore and the active snapshot is not
set.
3. In FillPortalStore, copy the snapshot to the portal's holdSnapshot.
4. RunFromStore uses holdSnapshot if any.
I'm not still confident on this, but it should be better than the v1.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
v2-0001-Appropriately-set-snapshot-on-cursor-fetch.patchtext/x-patch; charset=us-asciiDownload+90-7
At Wed, 30 Mar 2022 17:58:24 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
At Wed, 30 Mar 2022 11:46:13 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
But, in the first place the *fix* has been found to be wrong. I'm
going to search for the right fix..FETCH uses the snapshot at DECLARE. So anyhow I needed to set the
queryDesk's snapshot used in PortalRunSelect to the FETCH's portal's
holdSnapshot. What I did in this version is:
By the way, this is, given that the added check in init_toast_snapshot
is valid, a long-standing "bug", which at least back to 12.
I'm not sure what to do for this.
1. ignore the check for certain cases?
2. apply any fix only to master and call it a day. 14 and earlier
doesn't have the assertion check so they don't complain.
3. apply a fix to all affected versions.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
On Wed, Mar 30, 2022 at 05:58:24PM +0900, Kyotaro Horiguchi wrote:
I'm not still confident on this, but it should be better than the v1.
+Andres as this seems to be related to 277692220.
I added it as an Opened Item.
Hi,
On 2022-04-13 15:28:19 -0500, Justin Pryzby wrote:
On Wed, Mar 30, 2022 at 05:58:24PM +0900, Kyotaro Horiguchi wrote:
I'm not still confident on this, but it should be better than the v1.
+Andres as this seems to be related to 277692220.
FWIW, that'd just mean it's an old bug that wasn't easily noticeable
before, not that it's the fault of 277692220.
I added it as an Opened Item.
IOW, it'd belong in "Older bugs affecting stable branches".
Greetings,
Andres Freund
On Wed, Apr 13, 2022 at 8:38 PM Andres Freund <andres@anarazel.de> wrote:
On 2022-04-13 15:28:19 -0500, Justin Pryzby wrote:
On Wed, Mar 30, 2022 at 05:58:24PM +0900, Kyotaro Horiguchi wrote:
I'm not still confident on this, but it should be better than the v1.
+Andres as this seems to be related to 277692220.
FWIW, that'd just mean it's an old bug that wasn't easily noticeable
before, not that it's the fault of 277692220.
I think you're still on the hook to do something about it for this
release. You could decide to revert the commit adding the assertion
and punt on doing thing about the underlying bug, but we can't just be
like "oh, an assertion is failing, we'll get to that someday".
--
Robert Haas
EDB: http://www.enterprisedb.com
Robert Haas <robertmhaas@gmail.com> writes:
On Wed, Apr 13, 2022 at 8:38 PM Andres Freund <andres@anarazel.de> wrote:
FWIW, that'd just mean it's an old bug that wasn't easily noticeable
before, not that it's the fault of 277692220.
I think you're still on the hook to do something about it for this
release.
I think you're trying to shoot the messenger. As Andres says,
277692220 just exposes that there is some pre-existing bug here.
It's probably related to 84f5c2908, so I was planning to take a
look at it at some point, but there are a few other higher-priority
bugs in the way.
I see no point in reverting 277692220. Removing the Assert would
prevent, or at least complicate, detection of other similar bugs.
And it'd do nothing to help end users, who won't have assertions
enabled anyway.
regards, tom lane
On Thu, Apr 14, 2022 at 10:42 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Robert Haas <robertmhaas@gmail.com> writes:
On Wed, Apr 13, 2022 at 8:38 PM Andres Freund <andres@anarazel.de> wrote:
FWIW, that'd just mean it's an old bug that wasn't easily noticeable
before, not that it's the fault of 277692220.I think you're still on the hook to do something about it for this
release.I think you're trying to shoot the messenger. As Andres says,
277692220 just exposes that there is some pre-existing bug here.
It's probably related to 84f5c2908, so I was planning to take a
look at it at some point, but there are a few other higher-priority
bugs in the way.
Well, if you're willing to look at it that's fine, but I just don't
agree that it's OK to just commit things that add failing assertions
and drive off into the sunset. The code is always going to have a
certain number of unfixed bugs, and that's fine, and finding them is
good in itself, but people want to be able to run the software in the
meantime, and some of those people are developers or other individuals
who want to run it with assertions enabled. It's a judgement call
whether this assertion failure is going to bite enough people to be a
problem, but if it were something that happened easily enough to
prevent you from working on the source code, I'm sure you wouldn't be
OK with leaving it in there until someone got around to looking at it.
Given that it took about a month and a half for someone to report them
problem, it's not as bad as all that, but I guess I won't be surprised
if we keep getting complaints until something gets done.
--
Robert Haas
EDB: http://www.enterprisedb.com
Hi,
On 2022-04-14 11:33:45 -0400, Robert Haas wrote:
On Thu, Apr 14, 2022 at 10:42 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Robert Haas <robertmhaas@gmail.com> writes:
On Wed, Apr 13, 2022 at 8:38 PM Andres Freund <andres@anarazel.de> wrote:
FWIW, that'd just mean it's an old bug that wasn't easily noticeable
before, not that it's the fault of 277692220.I think you're still on the hook to do something about it for this
release.I think you're trying to shoot the messenger. As Andres says,
277692220 just exposes that there is some pre-existing bug here.
It's probably related to 84f5c2908, so I was planning to take a
look at it at some point, but there are a few other higher-priority
bugs in the way.Well, if you're willing to look at it that's fine, but I just don't
agree that it's OK to just commit things that add failing assertions
and drive off into the sunset.
I'm not planning to ride into the sunset / ignore this issue. All I said
is that it's imo not the right thing to say that that commit broke
things in 15. And that not a semantics game, because it means that the
fix needs to go back further than 277692220.
Greetings,
Andres Freund
Robert Haas <robertmhaas@gmail.com> writes:
On Thu, Apr 14, 2022 at 10:42 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
I think you're trying to shoot the messenger. As Andres says,
277692220 just exposes that there is some pre-existing bug here.
It's probably related to 84f5c2908, so I was planning to take a
look at it at some point, but there are a few other higher-priority
bugs in the way.
Well, if you're willing to look at it that's fine, but I just don't
agree that it's OK to just commit things that add failing assertions
and drive off into the sunset.
I don't think Andres had any intention of ignoring it indefinitely.
What he is doing is prioritizing it lower than several of the other
open items, an opinion I fully agree with.
regards, tom lane
On Thu, Apr 14, 2022 at 11:46 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
I don't think Andres had any intention of ignoring it indefinitely.
What he is doing is prioritizing it lower than several of the other
open items, an opinion I fully agree with.
Well, my point is that, historically, relegating things to the older
bugs section often means nothing happens, which I think would not be
great in this case. However, I'll try to shut up about the procedural
issues for the moment, since I seem to be in the minority.
I got curious and looked at the underlying problem here and I am
wondering whether HaveRegisteredOrActiveSnapshot() is just buggy. It
seems to me that the code is always going to return true if there are
any active snapshots, and the rest of the function is intended to test
whether there is a registered snapshot other than the catalog
snapshot. But I don't think that's what this code does:
if (pairingheap_is_empty(&RegisteredSnapshots) ||
!pairingheap_is_singular(&RegisteredSnapshots))
return false;
return CatalogSnapshot == NULL;
So, if there are 0 active snapshots we return false. And if there is 1
active snapshot then we return true if there's no catalog snapshot and
otherwise false. So far so good. But if there are 2 or more registered
snapshots then the pairing heap will be neither empty nor singular so
it seems to me we will return false, which seems to me to be the wrong
answer. I tried this:
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index a0be0c411a..4e8e26a362 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -1646,9 +1646,10 @@ HaveRegisteredOrActiveSnapshot(void)
* removed at any time due to invalidation processing. If explicitly
* registered more than one snapshot has to be in RegisteredSnapshots.
*/
- if (pairingheap_is_empty(&RegisteredSnapshots) ||
- !pairingheap_is_singular(&RegisteredSnapshots))
+ if (pairingheap_is_empty(&RegisteredSnapshots))
return false;
+ if (!pairingheap_is_singular(&RegisteredSnapshots))
+ return true;
return CatalogSnapshot == NULL;
}
I find that 'make check-world' passes with this change, which is
disturbing, because it also passes without this change. That means we
don't have any tests that reach HaveRegisteredOrActiveSnapshot() with
more than one registered snapshot.
Also, unless we have plans to use HaveRegisteredOrActiveSnapshot() in
more places, I think we should consider revising the whole approach
here. The way init_toast_snapshot() is coded, we basically have some
code that tests for active and registered snapshots and finds the
oldest one. We error out if there isn't one. And then we get to this
assertion, which checks the same stuff a second time but with an
additional check to see whether we ended up with the catalog snapshot.
Wouldn't it make more sense if GetOldestSnapshot() just refused to
return the catalog snapshot in the first place?
--
Robert Haas
EDB: http://www.enterprisedb.com
Hi,
On 2022-04-14 12:16:45 -0400, Robert Haas wrote:
I got curious and looked at the underlying problem here and I am
wondering whether HaveRegisteredOrActiveSnapshot() is just buggy. It
seems to me that the code is always going to return true if there are
any active snapshots, and the rest of the function is intended to test
whether there is a registered snapshot other than the catalog
snapshot. But I don't think that's what this code does:if (pairingheap_is_empty(&RegisteredSnapshots) ||
!pairingheap_is_singular(&RegisteredSnapshots))
return false;return CatalogSnapshot == NULL;
Certainly looks off...
I find that 'make check-world' passes with this change, which is
disturbing, because it also passes without this change. That means we
don't have any tests that reach HaveRegisteredOrActiveSnapshot() with
more than one registered snapshot.
Part of that is because right now the assertion is placed "too deep" -
it should be much higher up, so it's reached even if there's not
actually a toast datum. But there's of other bugs preventing that :(. A
lot of bugs have been hidden by the existence of CatalogSnapshot (which
of course isn't something one actually can rely on).
Also, unless we have plans to use HaveRegisteredOrActiveSnapshot() in
more places,
I think we should, but there's the other bugs that need to be fixed
first :(. Namely that we have plenty places doing catalog accesses
without an active or registered snapshot :(.
I think we should consider revising the whole approach here. The way
init_toast_snapshot() is coded, we basically have some code that tests
for active and registered snapshots and finds the oldest one. We error
out if there isn't one. And then we get to this assertion, which
checks the same stuff a second time but with an additional check to
see whether we ended up with the catalog snapshot. Wouldn't it make
more sense if GetOldestSnapshot() just refused to return the catalog
snapshot in the first place?
I'm worried that that could cause additional bugs. Consider code using
GetOldestSnapshot() to check if tuples need to be preserved or such.
Greetings,
Andres Freund
Hi,
On 2022-04-14 09:54:25 -0700, Andres Freund wrote:
On 2022-04-14 12:16:45 -0400, Robert Haas wrote:
I got curious and looked at the underlying problem here and I am
wondering whether HaveRegisteredOrActiveSnapshot() is just buggy. It
seems to me that the code is always going to return true if there are
any active snapshots, and the rest of the function is intended to test
whether there is a registered snapshot other than the catalog
snapshot. But I don't think that's what this code does:if (pairingheap_is_empty(&RegisteredSnapshots) ||
!pairingheap_is_singular(&RegisteredSnapshots))
return false;return CatalogSnapshot == NULL;
Certainly looks off...
I find that 'make check-world' passes with this change, which is
disturbing, because it also passes without this change. That means we
don't have any tests that reach HaveRegisteredOrActiveSnapshot() with
more than one registered snapshot.Part of that is because right now the assertion is placed "too deep" -
it should be much higher up, so it's reached even if there's not
actually a toast datum. But there's of other bugs preventing that :(. A
lot of bugs have been hidden by the existence of CatalogSnapshot (which
of course isn't something one actually can rely on).Also, unless we have plans to use HaveRegisteredOrActiveSnapshot() in
more places,I think we should, but there's the other bugs that need to be fixed
first :(. Namely that we have plenty places doing catalog accesses
without an active or registered snapshot :(.
Ah, we actually were debating some of these issues more recently, in:
/messages/by-id/20220311030721.olixpzcquqkw2qyt@alap3.anarazel.de
/messages/by-id/20220311021047.hgtqkrl6n52srvdu@alap3.anarazel.de
It looks like the same bug, and that the patch in this thread fixes
them. And that we need to backpatch the fix, right?
Greetings,
Andres Freund