TRAP: FailedAssertion("HaveRegisteredOrActiveSnapshot()", File: "toast_internals.c", Line: 670, PID: 19403)

Started by Erik Rijkersalmost 4 years ago46 messages
#1Erik Rijkers
er@xs4all.nl
1 attachment(s)

Hi,

On master I got a FailedAssertion("HaveRegisteredOrActiveSnapshot()"
on an assert-enabled instance and with (I think) data over a certain length.

I whittled it down to the attached bash (careful - it drops stuff). It
has 5 tsv-data lines (one long line) that COPY slurps into a table. The
middle, third line causes the problem, later on. Shortening the long
line to somewhere below 2000 characters fixes it again.

More info in the attached .sh file.

If debug-assert is 'off', the problem does not occur. (REL_14_STABLE
also does not have the problem, assertions or not)

thanks,

Erik Rijkers

Attachments:

bugsnapshot.shapplication/x-shellscript; name=bugsnapshot.shDownload
#2Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Erik Rijkers (#1)
Re: TRAP: FailedAssertion("HaveRegisteredOrActiveSnapshot()", File: "toast_internals.c", Line: 670, PID: 19403)

At Sun, 27 Mar 2022 20:32:45 +0200, Erik Rijkers <er@xs4all.nl> wrote in

On master I got a FailedAssertion("HaveRegisteredOrActiveSnapshot()"
on an assert-enabled instance and with (I think) data over a certain
length.

I whittled it down to the attached bash (careful - it drops stuff).
It has 5 tsv-data lines (one long line) that COPY slurps into a table.
The middle, third line causes the problem, later on. Shortening the
long line to somewhere below 2000 characters fixes it again.

More info in the attached .sh file.

It is reproducible for me. Thanks for the reproducer.

If debug-assert is 'off', the problem does not occur. (REL_14_STABLE
also does not have the problem, assertions or not)

It seems like related with [1]/messages/by-id/20210623035916.GL29179@telsasoft.com?

Inserting EnsurePortalSnapshotExists() to RunFromStore fixes this but
I'm not sure where is the right place to do this..

[1]: /messages/by-id/20210623035916.GL29179@telsasoft.com

For someone's information, this is more readable stack trace.

#0 0x00007f43aeed037f in raise () from /lib64/libc.so.6
#1 0x00007f43aeebadb5 in abort () from /lib64/libc.so.6
#2 0x0000000000b28747 in ExceptionalCondition (
conditionName=0xba2c48 "HaveRegisteredOrActiveSnapshot()",
errorType=0xba2882 "FailedAssertion",
fileName=0xba2870 "toast_internals.c", lineNumber=670) at assert.c:69
#3 0x00000000004ac776 in init_toast_snapshot (toast_snapshot=0x7ffce64f7440)
at toast_internals.c:670
#4 0x00000000005164ea in heap_fetch_toast_slice (toastrel=0x7f43b193cad0,
valueid=16393, attrsize=1848, sliceoffset=0, slicelength=1848,
result=0x1cbb948) at heaptoast.c:688
#5 0x000000000049fc86 in table_relation_fetch_toast_slice (
toastrel=0x7f43b193cad0, valueid=16393, attrsize=1848, sliceoffset=0,
slicelength=1848, result=0x1cbb948)
at ../../../../src/include/access/tableam.h:1892
#6 0x00000000004a0a0f in toast_fetch_datum (attr=0x1d6b171) at detoast.c:375
#7 0x000000000049fffb in detoast_attr (attr=0x1d6b171) at detoast.c:123
#8 0x0000000000b345ba in pg_detoast_datum_packed (datum=0x1d6b171)
at fmgr.c:1757
#9 0x0000000000aece72 in text_to_cstring (t=0x1d6b171) at varlena.c:225
#10 0x0000000000aedda2 in textout (fcinfo=0x7ffce64f77a0) at varlena.c:574
#11 0x0000000000b331bf in FunctionCall1Coll (flinfo=0x1d695e0, collation=0,
arg1=30847345) at fmgr.c:1138
#12 0x0000000000b3422b in OutputFunctionCall (flinfo=0x1d695e0, val=30847345)
at fmgr.c:1575
#13 0x00000000004a6b6c in printtup (slot=0x1cb81f0, self=0x1c96e90)
at printtup.c:357
#14 0x000000000099499f in RunFromStore (portal=0x1cf9380,
direction=ForwardScanDirection, count=0, dest=0x1c96e90) at pquery.c:1096
#15 0x00000000009944e3 in PortalRunSelect (portal=0x1cf9380, forward=true,
count=0, dest=0x1c96e90) at pquery.c:917
#16 0x00000000009941d3 in PortalRun (portal=0x1cf9380,
count=9223372036854775807, isTopLevel=true, run_once=true,
dest=0x1c96e90, altdest=0x1c96e90, qc=0x7ffce64f7ac0) at pquery.c:765
#17 0x000000000098df4b in exec_simple_query (
query_string=0x1c96030 "fetch all in myportal;") at postgres.c:1250
#18 0x00000000009923a3 in PostgresMain (dbname=0x1cc11b0 "postgres",
username=0x1cc1188 "horiguti") at postgres.c:4520
#19 0x00000000008c6caf in BackendRun (port=0x1cb74c0) at postmaster.c:4593
#20 0x00000000008c6631 in BackendStartup (port=0x1cb74c0) at postmaster.c:4321
#21 0x00000000008c29cb in ServerLoop () at postmaster.c:1801
#22 0x00000000008c2298 in PostmasterMain (argc=1, argv=0x1c8e0e0)
at postmaster.c:1473
#23 0x00000000007c14c3 in main (argc=1, argv=0x1c8e0e0) at main.c:202

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#3Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Kyotaro Horiguchi (#2)
1 attachment(s)
Re: TRAP: FailedAssertion("HaveRegisteredOrActiveSnapshot()", File: "toast_internals.c", Line: 670, PID: 19403)

At Mon, 28 Mar 2022 18:36:46 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

Inserting EnsurePortalSnapshotExists() to RunFromStore fixes this but
I'm not sure where is the right place to do this..

Then, I found that portal->holdSnapshot is that. I came up with the
attached. It does the follows:

1. Teach PlannedStmtRequiresSnapshot() to return true for FetchStmt.

2. Use holdSnapshot in RunFromStore if any.

The rerpducer is reduced to as small as the following.

CREATE TABLE t (a text);
INSERT INTO t VALUES('some random text');
BEGIN;
DECLARE c CURSOR FOR SELECT * FROM t;
FETCH ALL IN c;

But I haven't come up with a reasonable way to generate the 'some
random text' yet.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

use_holdsnapshot_to_read_cursor.patchtext/x-patch; charset=us-asciiDownload
diff --git a/src/backend/tcop/pquery.c b/src/backend/tcop/pquery.c
index 5aa5a350f3..8354029f2a 100644
--- a/src/backend/tcop/pquery.c
+++ b/src/backend/tcop/pquery.c
@@ -1068,6 +1068,18 @@ RunFromStore(Portal portal, ScanDirection direction, uint64 count,
 
 	dest->rStartup(dest, CMD_SELECT, portal->tupDesc);
 
+	/*
+	 * If holdSnapshot is set, that means we should use the snapshot to read
+	 * this store.
+	 */
+	if (portal->holdSnapshot)
+	{
+		Assert(portal->portalSnapshot == 0);
+		PushActiveSnapshotWithLevel(portal->holdSnapshot,
+									portal->createLevel);
+		portal->portalSnapshot = GetActiveSnapshot();
+	}
+	
 	if (ScanDirectionIsNoMovement(direction))
 	{
 		/* do nothing except start/stop the destination */
@@ -1114,6 +1126,13 @@ RunFromStore(Portal portal, ScanDirection direction, uint64 count,
 
 	dest->rShutdown(dest);
 
+	if (portal->holdSnapshot)
+	{
+		Assert(portal->portalSnapshot == GetActiveSnapshot());
+		PopActiveSnapshot();
+		portal->portalSnapshot = NULL;
+	}
+
 	ExecDropSingleTupleTableSlot(slot);
 
 	return current_tuple_count;
@@ -1756,7 +1775,6 @@ PlannedStmtRequiresSnapshot(PlannedStmt *pstmt)
 		IsA(utilityStmt, VariableShowStmt) ||
 		IsA(utilityStmt, ConstraintsSetStmt) ||
 	/* efficiency hacks from here down */
-		IsA(utilityStmt, FetchStmt) ||
 		IsA(utilityStmt, ListenStmt) ||
 		IsA(utilityStmt, NotifyStmt) ||
 		IsA(utilityStmt, UnlistenStmt) ||
#4Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Kyotaro Horiguchi (#3)
1 attachment(s)
Re: TRAP: FailedAssertion("HaveRegisteredOrActiveSnapshot()", File: "toast_internals.c", Line: 670, PID: 19403)

At Tue, 29 Mar 2022 17:06:21 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

At Mon, 28 Mar 2022 18:36:46 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
Then, I found that portal->holdSnapshot is that. I came up with the
attached. It does the follows:

1. Teach PlannedStmtRequiresSnapshot() to return true for FetchStmt.

2. Use holdSnapshot in RunFromStore if any.

The rerpducer is reduced to as small as the following.

CREATE TABLE t (a text);
INSERT INTO t VALUES('some random text');
BEGIN;
DECLARE c CURSOR FOR SELECT * FROM t;
FETCH ALL IN c;

But I haven't come up with a reasonable way to generate the 'some
random text' yet.

I gave up and took a straightforward way to generate one.

I don't like that it uses a fixed length for the random text, but
anyway it works for now...

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

v1-0001-Appropriately-set-snapshot-on-cursor-fetch.patchtext/x-patch; charset=us-asciiDownload
From 669e72feb29262110fb7085601d1042b781d3a2b Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Tue, 29 Mar 2022 17:53:39 +0900
Subject: [PATCH v1] Appropriately set snapshot on cursor fetch

84f5c2908d thought that FETCH doesn't need a snapshot, but it is
actually needed.  FETCH can crash when accessing a toast value.  To
fix this, teach PlannedStmtRequiresSnapshot that FETCH needs a
snapshot then tell RunFromStore to use the snapshot.

Reported-by: Erik Rijkers <er@xs4all.nl>
---
 src/backend/tcop/pquery.c             | 17 ++++++++++++++++-
 src/test/regress/expected/portals.out | 20 ++++++++++++++++++++
 src/test/regress/sql/portals.sql      | 22 ++++++++++++++++++++++
 3 files changed, 58 insertions(+), 1 deletion(-)

diff --git a/src/backend/tcop/pquery.c b/src/backend/tcop/pquery.c
index 5aa5a350f3..08b37fb63d 100644
--- a/src/backend/tcop/pquery.c
+++ b/src/backend/tcop/pquery.c
@@ -1068,6 +1068,15 @@ RunFromStore(Portal portal, ScanDirection direction, uint64 count,
 
 	dest->rStartup(dest, CMD_SELECT, portal->tupDesc);
 
+	/* Valid holdSnapshot here means it should be used to read this store. */
+	if (portal->holdSnapshot)
+	{
+		Assert(portal->portalSnapshot == 0);
+		PushActiveSnapshotWithLevel(portal->holdSnapshot,
+									portal->createLevel);
+		portal->portalSnapshot = GetActiveSnapshot();
+	}
+	
 	if (ScanDirectionIsNoMovement(direction))
 	{
 		/* do nothing except start/stop the destination */
@@ -1114,6 +1123,13 @@ RunFromStore(Portal portal, ScanDirection direction, uint64 count,
 
 	dest->rShutdown(dest);
 
+	if (portal->holdSnapshot)
+	{
+		Assert(portal->portalSnapshot == GetActiveSnapshot());
+		PopActiveSnapshot();
+		portal->portalSnapshot = NULL;
+	}
+
 	ExecDropSingleTupleTableSlot(slot);
 
 	return current_tuple_count;
@@ -1756,7 +1772,6 @@ PlannedStmtRequiresSnapshot(PlannedStmt *pstmt)
 		IsA(utilityStmt, VariableShowStmt) ||
 		IsA(utilityStmt, ConstraintsSetStmt) ||
 	/* efficiency hacks from here down */
-		IsA(utilityStmt, FetchStmt) ||
 		IsA(utilityStmt, ListenStmt) ||
 		IsA(utilityStmt, NotifyStmt) ||
 		IsA(utilityStmt, UnlistenStmt) ||
diff --git a/src/test/regress/expected/portals.out b/src/test/regress/expected/portals.out
index 9da74876e1..fd56859370 100644
--- a/src/test/regress/expected/portals.out
+++ b/src/test/regress/expected/portals.out
@@ -1536,3 +1536,23 @@ fetch backward all in c2;
 (3 rows)
 
 rollback;
+-- Check if snapshot is available for cursor while accessing toast value
+begin;
+create function r (l int) returns text as $$
+DECLARE
+   i int;
+   s text;
+BEGIN
+   s := '';
+   for i in 0..l loop
+     s := s || chr((random() * 254)::int + 1);
+   end loop;
+   return s;
+END;
+$$ language plpgsql;
+create table t(a text);
+insert into t values(r(4192));
+declare c cursor for select a from t;
+-- the following shouldn't crash
+fetch all in c \gset result
+rollback;
diff --git a/src/test/regress/sql/portals.sql b/src/test/regress/sql/portals.sql
index eadf6ed942..0dfe0ad981 100644
--- a/src/test/regress/sql/portals.sql
+++ b/src/test/regress/sql/portals.sql
@@ -581,3 +581,25 @@ declare c2 scroll cursor for select generate_series(1,3) as g;
 fetch all in c2;
 fetch backward all in c2;
 rollback;
+
+-- Check if snapshot is available for cursor while accessing toast value
+begin;
+create function r (l int) returns text as $$
+DECLARE
+   i int;
+   s text;
+BEGIN
+   s := '';
+   for i in 0..l loop
+     s := s || chr((random() * 254)::int + 1);
+   end loop;
+   return s;
+END;
+$$ language plpgsql;
+
+create table t(a text);
+insert into t values(r(4192));
+declare c cursor for select a from t;
+-- the following shouldn't crash
+fetch all in c \gset result
+rollback;
-- 
2.27.0

#5Matthias van de Meent
boekewurm+postgres@gmail.com
In reply to: Kyotaro Horiguchi (#4)
Re: TRAP: FailedAssertion("HaveRegisteredOrActiveSnapshot()", File: "toast_internals.c", Line: 670, PID: 19403)

On Tue, 29 Mar 2022 at 11:10, Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:

At Tue, 29 Mar 2022 17:06:21 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

At Mon, 28 Mar 2022 18:36:46 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
Then, I found that portal->holdSnapshot is that. I came up with the
attached. It does the follows:

1. Teach PlannedStmtRequiresSnapshot() to return true for FetchStmt.

2. Use holdSnapshot in RunFromStore if any.

The rerpducer is reduced to as small as the following.

CREATE TABLE t (a text);
INSERT INTO t VALUES('some random text');
BEGIN;
DECLARE c CURSOR FOR SELECT * FROM t;
FETCH ALL IN c;

But I haven't come up with a reasonable way to generate the 'some
random text' yet.

I gave up and took a straightforward way to generate one.

I don't like that it uses a fixed length for the random text, but
anyway it works for now...

An shorter (?) reproducer might be the following, which forces any
value for 'a' to be toasted and thus triggering the check in
init_toast_snapshot regardless of value length:

CREATE TABLE t (a text);
ALTER TABLE t ALTER COLUMN a SET STORAGE EXTERNAL;
INSERT INTO t VALUES ('toast');
BEGIN;
DECLARE c CURSOR FOR SELECT * FROM t;
FETCH ALL IN c;

Enjoy,

-Matthias

#6Erik Rijkers
er@xs4all.nl
In reply to: Matthias van de Meent (#5)
Re: TRAP: FailedAssertion("HaveRegisteredOrActiveSnapshot()", File: "toast_internals.c", Line: 670, PID: 19403)

Op 29-03-2022 om 12:50 schreef Matthias van de Meent:

On Tue, 29 Mar 2022 at 11:10, Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:

At Tue, 29 Mar 2022 17:06:21 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

At Mon, 28 Mar 2022 18:36:46 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
Then, I found that portal->holdSnapshot is that. I came up with the
attached. It does the follows:

1. Teach PlannedStmtRequiresSnapshot() to return true for FetchStmt.

2. Use holdSnapshot in RunFromStore if any.

The rerpducer is reduced to as small as the following.

CREATE TABLE t (a text);
INSERT INTO t VALUES('some random text');
BEGIN;
DECLARE c CURSOR FOR SELECT * FROM t;
FETCH ALL IN c;

But I haven't come up with a reasonable way to generate the 'some
random text' yet.

I gave up and took a straightforward way to generate one.

I don't like that it uses a fixed length for the random text, but
anyway it works for now...

An shorter (?) reproducer might be the following, which forces any
value for 'a' to be toasted and thus triggering the check in
init_toast_snapshot regardless of value length:

CREATE TABLE t (a text);
ALTER TABLE t ALTER COLUMN a SET STORAGE EXTERNAL;
INSERT INTO t VALUES ('toast');
BEGIN;
DECLARE c CURSOR FOR SELECT * FROM t;
FETCH ALL IN c;

Excellent. That indeed immediately forces the error.

(and the patch prevents it)

Thanks!

Show quoted text

Enjoy,

-Matthias

#7Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Erik Rijkers (#6)
Re: TRAP: FailedAssertion("HaveRegisteredOrActiveSnapshot()", File: "toast_internals.c", Line: 670, PID: 19403)

At Tue, 29 Mar 2022 13:29:15 +0200, Erik Rijkers <er@xs4all.nl> wrote in

Op 29-03-2022 om 12:50 schreef Matthias van de Meent:

An shorter (?) reproducer might be the following, which forces any
value for 'a' to be toasted and thus triggering the check in
init_toast_snapshot regardless of value length:
CREATE TABLE t (a text);
ALTER TABLE t ALTER COLUMN a SET STORAGE EXTERNAL;
INSERT INTO t VALUES ('toast');
BEGIN;
DECLARE c CURSOR FOR SELECT * FROM t;
FETCH ALL IN c;

Yeah, unfortunately I tried that first and saw it didn't work. And it
still doesn't for me. With such a short text pg_detoast_datum_pakced
doesn't call detoast_attr. Actually it is VARATT_IS_1B. (@master)

I think I'm missing something here. I'm going to examine around.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#8Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Kyotaro Horiguchi (#7)
Re: TRAP: FailedAssertion("HaveRegisteredOrActiveSnapshot()", File: "toast_internals.c", Line: 670, PID: 19403)

At Wed, 30 Mar 2022 10:06:17 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

At Tue, 29 Mar 2022 13:29:15 +0200, Erik Rijkers <er@xs4all.nl> wrote in

Op 29-03-2022 om 12:50 schreef Matthias van de Meent:

An shorter (?) reproducer might be the following, which forces any
value for 'a' to be toasted and thus triggering the check in
init_toast_snapshot regardless of value length:
CREATE TABLE t (a text);
ALTER TABLE t ALTER COLUMN a SET STORAGE EXTERNAL;
INSERT INTO t VALUES ('toast');
BEGIN;
DECLARE c CURSOR FOR SELECT * FROM t;
FETCH ALL IN c;

Yeah, unfortunately I tried that first and saw it didn't work. And it
still doesn't for me. With such a short text pg_detoast_datum_pakced
doesn't call detoast_attr. Actually it is VARATT_IS_1B. (@master)

I think I'm missing something here. I'm going to examine around.

Hmm. Strange. My memory tells that I did the same thing before.. I
thought that it is somewhat related to compression since repeat('x',
4096) didin't seem working at that time, but it worked this time.
Maybe I was confused between extended and external..

But, in the first place the *fix* has been found to be wrong. I'm
going to search for the right fix..

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#9Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Kyotaro Horiguchi (#8)
1 attachment(s)
Re: TRAP: FailedAssertion("HaveRegisteredOrActiveSnapshot()", File: "toast_internals.c", Line: 670, PID: 19403)

At Wed, 30 Mar 2022 11:46:13 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

But, in the first place the *fix* has been found to be wrong. I'm
going to search for the right fix..

FETCH uses the snapshot at DECLARE. So anyhow I needed to set the
queryDesk's snapshot used in PortalRunSelect to the FETCH's portal's
holdSnapshot. What I did in this version is:

1. Add a new member "snapshot" to the type DestReceiver.

2. In PortalRunSelect, set the DECLARE'd query's snapshot to the
member iff the dest is tupelstore and the active snapshot is not
set.

3. In FillPortalStore, copy the snapshot to the portal's holdSnapshot.

4. RunFromStore uses holdSnapshot if any.

I'm not still confident on this, but it should be better than the v1.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

v2-0001-Appropriately-set-snapshot-on-cursor-fetch.patchtext/x-patch; charset=us-asciiDownload
From 8d40409d8c8e80e53ff4a06eb9ac99c5f30fb427 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Tue, 29 Mar 2022 17:53:39 +0900
Subject: [PATCH v2] Appropriately set snapshot on cursor fetch

FETCH runs with the registered snapshot at DECLARE.  So if the result
contained TOASTed values, the same snapshot is required.  This patch
make PortalRunSelect to pass the DECLARE'd query's snapshot to outer
portal so that RunFromStore can use the snapshot.

Reported-by: Erik Rijkers <er@xs4all.nl>
---
 src/backend/tcop/dest.c               |  8 ++--
 src/backend/tcop/pquery.c             | 65 ++++++++++++++++++++++++++-
 src/include/tcop/dest.h               |  4 +-
 src/test/regress/expected/portals.out |  9 ++++
 src/test/regress/sql/portals.sql      | 10 +++++
 5 files changed, 90 insertions(+), 6 deletions(-)

diff --git a/src/backend/tcop/dest.c b/src/backend/tcop/dest.c
index c952cbea8b..4a8a67323e 100644
--- a/src/backend/tcop/dest.c
+++ b/src/backend/tcop/dest.c
@@ -69,22 +69,22 @@ donothingCleanup(DestReceiver *self)
  */
 static const DestReceiver donothingDR = {
 	donothingReceive, donothingStartup, donothingCleanup, donothingCleanup,
-	DestNone
+	NULL, DestNone
 };
 
 static const DestReceiver debugtupDR = {
 	debugtup, debugStartup, donothingCleanup, donothingCleanup,
-	DestDebug
+	NULL, DestDebug
 };
 
 static const DestReceiver printsimpleDR = {
 	printsimple, printsimple_startup, donothingCleanup, donothingCleanup,
-	DestRemoteSimple
+	NULL, DestRemoteSimple
 };
 
 static const DestReceiver spi_printtupDR = {
 	spi_printtup, spi_dest_startup, donothingCleanup, donothingCleanup,
-	DestSPI
+	NULL, DestSPI
 };
 
 /*
diff --git a/src/backend/tcop/pquery.c b/src/backend/tcop/pquery.c
index 5aa5a350f3..001ff30db3 100644
--- a/src/backend/tcop/pquery.c
+++ b/src/backend/tcop/pquery.c
@@ -27,7 +27,7 @@
 #include "utils/memutils.h"
 #include "utils/snapmgr.h"
 
-
+extern const char *resownname(ResourceOwner o);
 /*
  * ActivePortal is the currently executing Portal (the most closely nested,
  * if there are several).
@@ -925,6 +925,26 @@ PortalRunSelect(Portal portal,
 						portal->run_once);
 			nprocessed = queryDesc->estate->es_processed;
 			PopActiveSnapshot();
+
+			/*
+			 * When the destination is DestTuplestore, the query snapshot is
+			 * required to read the result.  Since the snapshot is used as
+			 * holdSnapshot of the upper portal, its lifetime must be equal to
+			 * or longer than the life of the upper portal. If the queryDesc
+			 * comes from GetPortalByName, the lifetime of the snapshot should
+			 * be longer. On the other hand if the query comes from
+			 * GetCachedPlan, the snapshot would be freed before the upper
+			 * portal fetches the stored tuples. In that case, PortalRunUtility
+			 * should have set the active snapshot and everything goes fine
+			 * without setting the tuplestore's snapshot here.
+			 */
+			if (dest->mydest == DestTuplestore)
+			{
+				if (!ActiveSnapshotSet())
+					dest->snapshot = queryDesc->snapshot;
+				else
+					Assert(queryDesc->snapshot == GetActiveSnapshot());
+			}
 		}
 
 		if (!ScanDirectionIsNoMovement(direction))
@@ -965,6 +985,18 @@ PortalRunSelect(Portal portal,
 						portal->run_once);
 			nprocessed = queryDesc->estate->es_processed;
 			PopActiveSnapshot();
+
+			/*
+			 * Set query snapshot for the upper portal. See the comment above
+			 * for the details.
+			 */
+			if (dest->mydest == DestTuplestore)
+			{
+				if (!ActiveSnapshotSet())
+					dest->snapshot = queryDesc->snapshot;
+				else
+					Assert(queryDesc->snapshot == GetActiveSnapshot());
+			}
 		}
 
 		if (!ScanDirectionIsNoMovement(direction))
@@ -1038,6 +1070,16 @@ FillPortalStore(Portal portal, bool isTopLevel)
 			break;
 	}
 
+	/*
+	 * Set holdSnapshot if passed from the command. It should be set only if
+	 * holdSnapshot is not set yet.
+	 */
+	if (treceiver->snapshot)
+	{
+		Assert(portal->holdSnapshot == NULL);
+		portal->holdSnapshot = RegisterSnapshot(treceiver->snapshot);
+	}
+
 	/* Override portal completion data with actual command results */
 	if (qc.commandTag != CMDTAG_UNKNOWN)
 		CopyQueryCompletion(&portal->qc, &qc);
@@ -1063,9 +1105,22 @@ RunFromStore(Portal portal, ScanDirection direction, uint64 count,
 {
 	uint64		current_tuple_count = 0;
 	TupleTableSlot *slot;
+	bool		do_free_snapshot = false;
 
 	slot = MakeSingleTupleTableSlot(portal->tupDesc, &TTSOpsMinimalTuple);
 
+	/*
+	 * There's a case where the filler of the holdStore may set holdSnapshot
+	 * even if our caller didn't set one. Use that snapshot in that case.
+	 */
+	if (portal->portalSnapshot == NULL && portal->holdSnapshot)
+	{
+		PushActiveSnapshotWithLevel(portal->holdSnapshot,
+									portal->createLevel);
+		portal->portalSnapshot = GetActiveSnapshot();
+		do_free_snapshot = true;
+	}
+
 	dest->rStartup(dest, CMD_SELECT, portal->tupDesc);
 
 	if (ScanDirectionIsNoMovement(direction))
@@ -1114,6 +1169,14 @@ RunFromStore(Portal portal, ScanDirection direction, uint64 count,
 
 	dest->rShutdown(dest);
 
+	/* Delete the active snapshot if it is set in this function. */
+	if (do_free_snapshot)
+	{
+		Assert(portal->portalSnapshot == GetActiveSnapshot());
+		PopActiveSnapshot();
+		portal->portalSnapshot = NULL;
+	}
+
 	ExecDropSingleTupleTableSlot(slot);
 
 	return current_tuple_count;
diff --git a/src/include/tcop/dest.h b/src/include/tcop/dest.h
index 3c3eabae67..b90dbcc2f9 100644
--- a/src/include/tcop/dest.h
+++ b/src/include/tcop/dest.h
@@ -69,7 +69,7 @@
 
 #include "executor/tuptable.h"
 #include "tcop/cmdtag.h"
-
+#include "utils/snapshot.h"
 
 /* buffer size to use for command completion tags */
 #define COMPLETION_TAG_BUFSIZE	64
@@ -125,6 +125,8 @@ struct _DestReceiver
 	void		(*rShutdown) (DestReceiver *self);
 	/* Destroy the receiver object itself (if dynamically allocated) */
 	void		(*rDestroy) (DestReceiver *self);
+	/* tuplestore only: snapshot to use when reading the tuplestore */
+	Snapshot	snapshot;
 	/* CommandDest code for this receiver */
 	CommandDest mydest;
 	/* Private fields might appear beyond this point... */
diff --git a/src/test/regress/expected/portals.out b/src/test/regress/expected/portals.out
index 9da74876e1..98a7b04b4d 100644
--- a/src/test/regress/expected/portals.out
+++ b/src/test/regress/expected/portals.out
@@ -1536,3 +1536,12 @@ fetch backward all in c2;
 (3 rows)
 
 rollback;
+-- Check if snapshot is available for cursor while accessing toast value
+begin;
+create table t(a text);
+alter table t alter column a set storage external;
+insert into t values(repeat('x', 4096));
+declare c cursor for select a from t;
+-- the following shouldn't crash
+fetch all in c \gset result
+rollback;
diff --git a/src/test/regress/sql/portals.sql b/src/test/regress/sql/portals.sql
index eadf6ed942..545fad1e04 100644
--- a/src/test/regress/sql/portals.sql
+++ b/src/test/regress/sql/portals.sql
@@ -581,3 +581,13 @@ declare c2 scroll cursor for select generate_series(1,3) as g;
 fetch all in c2;
 fetch backward all in c2;
 rollback;
+
+-- Check if snapshot is available for cursor while accessing toast value
+begin;
+create table t(a text);
+alter table t alter column a set storage external;
+insert into t values(repeat('x', 4096));
+declare c cursor for select a from t;
+-- the following shouldn't crash
+fetch all in c \gset result
+rollback;
-- 
2.27.0

#10Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Kyotaro Horiguchi (#9)
Re: TRAP: FailedAssertion("HaveRegisteredOrActiveSnapshot()", File: "toast_internals.c", Line: 670, PID: 19403)

At Wed, 30 Mar 2022 17:58:24 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

At Wed, 30 Mar 2022 11:46:13 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

But, in the first place the *fix* has been found to be wrong. I'm
going to search for the right fix..

FETCH uses the snapshot at DECLARE. So anyhow I needed to set the
queryDesk's snapshot used in PortalRunSelect to the FETCH's portal's
holdSnapshot. What I did in this version is:

By the way, this is, given that the added check in init_toast_snapshot
is valid, a long-standing "bug", which at least back to 12.

I'm not sure what to do for this.

1. ignore the check for certain cases?

2. apply any fix only to master and call it a day. 14 and earlier
doesn't have the assertion check so they don't complain.

3. apply a fix to all affected versions.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#11Justin Pryzby
pryzby@telsasoft.com
In reply to: Kyotaro Horiguchi (#9)
Re: TRAP: FailedAssertion("HaveRegisteredOrActiveSnapshot()", File: "toast_internals.c", Line: 670, PID: 19403)

On Wed, Mar 30, 2022 at 05:58:24PM +0900, Kyotaro Horiguchi wrote:

I'm not still confident on this, but it should be better than the v1.

+Andres as this seems to be related to 277692220.
I added it as an Opened Item.

#12Andres Freund
andres@anarazel.de
In reply to: Justin Pryzby (#11)
Re: TRAP: FailedAssertion("HaveRegisteredOrActiveSnapshot()", File: "toast_internals.c", Line: 670, PID: 19403)

Hi,

On 2022-04-13 15:28:19 -0500, Justin Pryzby wrote:

On Wed, Mar 30, 2022 at 05:58:24PM +0900, Kyotaro Horiguchi wrote:

I'm not still confident on this, but it should be better than the v1.

+Andres as this seems to be related to 277692220.

FWIW, that'd just mean it's an old bug that wasn't easily noticeable
before, not that it's the fault of 277692220.

I added it as an Opened Item.

IOW, it'd belong in "Older bugs affecting stable branches".

Greetings,

Andres Freund

#13Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#12)
Re: TRAP: FailedAssertion("HaveRegisteredOrActiveSnapshot()", File: "toast_internals.c", Line: 670, PID: 19403)

On Wed, Apr 13, 2022 at 8:38 PM Andres Freund <andres@anarazel.de> wrote:

On 2022-04-13 15:28:19 -0500, Justin Pryzby wrote:

On Wed, Mar 30, 2022 at 05:58:24PM +0900, Kyotaro Horiguchi wrote:

I'm not still confident on this, but it should be better than the v1.

+Andres as this seems to be related to 277692220.

FWIW, that'd just mean it's an old bug that wasn't easily noticeable
before, not that it's the fault of 277692220.

I think you're still on the hook to do something about it for this
release. You could decide to revert the commit adding the assertion
and punt on doing thing about the underlying bug, but we can't just be
like "oh, an assertion is failing, we'll get to that someday".

--
Robert Haas
EDB: http://www.enterprisedb.com

#14Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#13)
Re: TRAP: FailedAssertion("HaveRegisteredOrActiveSnapshot()", File: "toast_internals.c", Line: 670, PID: 19403)

Robert Haas <robertmhaas@gmail.com> writes:

On Wed, Apr 13, 2022 at 8:38 PM Andres Freund <andres@anarazel.de> wrote:

FWIW, that'd just mean it's an old bug that wasn't easily noticeable
before, not that it's the fault of 277692220.

I think you're still on the hook to do something about it for this
release.

I think you're trying to shoot the messenger. As Andres says,
277692220 just exposes that there is some pre-existing bug here.
It's probably related to 84f5c2908, so I was planning to take a
look at it at some point, but there are a few other higher-priority
bugs in the way.

I see no point in reverting 277692220. Removing the Assert would
prevent, or at least complicate, detection of other similar bugs.
And it'd do nothing to help end users, who won't have assertions
enabled anyway.

regards, tom lane

#15Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#14)
Re: TRAP: FailedAssertion("HaveRegisteredOrActiveSnapshot()", File: "toast_internals.c", Line: 670, PID: 19403)

On Thu, Apr 14, 2022 at 10:42 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

On Wed, Apr 13, 2022 at 8:38 PM Andres Freund <andres@anarazel.de> wrote:

FWIW, that'd just mean it's an old bug that wasn't easily noticeable
before, not that it's the fault of 277692220.

I think you're still on the hook to do something about it for this
release.

I think you're trying to shoot the messenger. As Andres says,
277692220 just exposes that there is some pre-existing bug here.
It's probably related to 84f5c2908, so I was planning to take a
look at it at some point, but there are a few other higher-priority
bugs in the way.

Well, if you're willing to look at it that's fine, but I just don't
agree that it's OK to just commit things that add failing assertions
and drive off into the sunset. The code is always going to have a
certain number of unfixed bugs, and that's fine, and finding them is
good in itself, but people want to be able to run the software in the
meantime, and some of those people are developers or other individuals
who want to run it with assertions enabled. It's a judgement call
whether this assertion failure is going to bite enough people to be a
problem, but if it were something that happened easily enough to
prevent you from working on the source code, I'm sure you wouldn't be
OK with leaving it in there until someone got around to looking at it.
Given that it took about a month and a half for someone to report them
problem, it's not as bad as all that, but I guess I won't be surprised
if we keep getting complaints until something gets done.

--
Robert Haas
EDB: http://www.enterprisedb.com

#16Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#15)
Re: TRAP: FailedAssertion("HaveRegisteredOrActiveSnapshot()", File: "toast_internals.c", Line: 670, PID: 19403)

Hi,

On 2022-04-14 11:33:45 -0400, Robert Haas wrote:

On Thu, Apr 14, 2022 at 10:42 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

On Wed, Apr 13, 2022 at 8:38 PM Andres Freund <andres@anarazel.de> wrote:

FWIW, that'd just mean it's an old bug that wasn't easily noticeable
before, not that it's the fault of 277692220.

I think you're still on the hook to do something about it for this
release.

I think you're trying to shoot the messenger. As Andres says,
277692220 just exposes that there is some pre-existing bug here.
It's probably related to 84f5c2908, so I was planning to take a
look at it at some point, but there are a few other higher-priority
bugs in the way.

Well, if you're willing to look at it that's fine, but I just don't
agree that it's OK to just commit things that add failing assertions
and drive off into the sunset.

I'm not planning to ride into the sunset / ignore this issue. All I said
is that it's imo not the right thing to say that that commit broke
things in 15. And that not a semantics game, because it means that the
fix needs to go back further than 277692220.

Greetings,

Andres Freund

#17Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#15)
Re: TRAP: FailedAssertion("HaveRegisteredOrActiveSnapshot()", File: "toast_internals.c", Line: 670, PID: 19403)

Robert Haas <robertmhaas@gmail.com> writes:

On Thu, Apr 14, 2022 at 10:42 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

I think you're trying to shoot the messenger. As Andres says,
277692220 just exposes that there is some pre-existing bug here.
It's probably related to 84f5c2908, so I was planning to take a
look at it at some point, but there are a few other higher-priority
bugs in the way.

Well, if you're willing to look at it that's fine, but I just don't
agree that it's OK to just commit things that add failing assertions
and drive off into the sunset.

I don't think Andres had any intention of ignoring it indefinitely.
What he is doing is prioritizing it lower than several of the other
open items, an opinion I fully agree with.

regards, tom lane

#18Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#17)
Re: TRAP: FailedAssertion("HaveRegisteredOrActiveSnapshot()", File: "toast_internals.c", Line: 670, PID: 19403)

On Thu, Apr 14, 2022 at 11:46 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

I don't think Andres had any intention of ignoring it indefinitely.
What he is doing is prioritizing it lower than several of the other
open items, an opinion I fully agree with.

Well, my point is that, historically, relegating things to the older
bugs section often means nothing happens, which I think would not be
great in this case. However, I'll try to shut up about the procedural
issues for the moment, since I seem to be in the minority.

I got curious and looked at the underlying problem here and I am
wondering whether HaveRegisteredOrActiveSnapshot() is just buggy. It
seems to me that the code is always going to return true if there are
any active snapshots, and the rest of the function is intended to test
whether there is a registered snapshot other than the catalog
snapshot. But I don't think that's what this code does:

if (pairingheap_is_empty(&RegisteredSnapshots) ||
!pairingheap_is_singular(&RegisteredSnapshots))
return false;

return CatalogSnapshot == NULL;

So, if there are 0 active snapshots we return false. And if there is 1
active snapshot then we return true if there's no catalog snapshot and
otherwise false. So far so good. But if there are 2 or more registered
snapshots then the pairing heap will be neither empty nor singular so
it seems to me we will return false, which seems to me to be the wrong
answer. I tried this:

diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index a0be0c411a..4e8e26a362 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -1646,9 +1646,10 @@ HaveRegisteredOrActiveSnapshot(void)
      * removed at any time due to invalidation processing. If explicitly
      * registered more than one snapshot has to be in RegisteredSnapshots.
      */
-    if (pairingheap_is_empty(&RegisteredSnapshots) ||
-        !pairingheap_is_singular(&RegisteredSnapshots))
+    if (pairingheap_is_empty(&RegisteredSnapshots))
         return false;
+    if (!pairingheap_is_singular(&RegisteredSnapshots))
+        return true;

return CatalogSnapshot == NULL;
}

I find that 'make check-world' passes with this change, which is
disturbing, because it also passes without this change. That means we
don't have any tests that reach HaveRegisteredOrActiveSnapshot() with
more than one registered snapshot.

Also, unless we have plans to use HaveRegisteredOrActiveSnapshot() in
more places, I think we should consider revising the whole approach
here. The way init_toast_snapshot() is coded, we basically have some
code that tests for active and registered snapshots and finds the
oldest one. We error out if there isn't one. And then we get to this
assertion, which checks the same stuff a second time but with an
additional check to see whether we ended up with the catalog snapshot.
Wouldn't it make more sense if GetOldestSnapshot() just refused to
return the catalog snapshot in the first place?

--
Robert Haas
EDB: http://www.enterprisedb.com

#19Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#18)
Re: TRAP: FailedAssertion("HaveRegisteredOrActiveSnapshot()", File: "toast_internals.c", Line: 670, PID: 19403)

Hi,

On 2022-04-14 12:16:45 -0400, Robert Haas wrote:

I got curious and looked at the underlying problem here and I am
wondering whether HaveRegisteredOrActiveSnapshot() is just buggy. It
seems to me that the code is always going to return true if there are
any active snapshots, and the rest of the function is intended to test
whether there is a registered snapshot other than the catalog
snapshot. But I don't think that's what this code does:

if (pairingheap_is_empty(&RegisteredSnapshots) ||
!pairingheap_is_singular(&RegisteredSnapshots))
return false;

return CatalogSnapshot == NULL;

Certainly looks off...

I find that 'make check-world' passes with this change, which is
disturbing, because it also passes without this change. That means we
don't have any tests that reach HaveRegisteredOrActiveSnapshot() with
more than one registered snapshot.

Part of that is because right now the assertion is placed "too deep" -
it should be much higher up, so it's reached even if there's not
actually a toast datum. But there's of other bugs preventing that :(. A
lot of bugs have been hidden by the existence of CatalogSnapshot (which
of course isn't something one actually can rely on).

Also, unless we have plans to use HaveRegisteredOrActiveSnapshot() in
more places,

I think we should, but there's the other bugs that need to be fixed
first :(. Namely that we have plenty places doing catalog accesses
without an active or registered snapshot :(.

I think we should consider revising the whole approach here. The way
init_toast_snapshot() is coded, we basically have some code that tests
for active and registered snapshots and finds the oldest one. We error
out if there isn't one. And then we get to this assertion, which
checks the same stuff a second time but with an additional check to
see whether we ended up with the catalog snapshot. Wouldn't it make
more sense if GetOldestSnapshot() just refused to return the catalog
snapshot in the first place?

I'm worried that that could cause additional bugs. Consider code using
GetOldestSnapshot() to check if tuples need to be preserved or such.

Greetings,

Andres Freund

#20Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#19)
Re: TRAP: FailedAssertion("HaveRegisteredOrActiveSnapshot()", File: "toast_internals.c", Line: 670, PID: 19403)

Hi,

On 2022-04-14 09:54:25 -0700, Andres Freund wrote:

On 2022-04-14 12:16:45 -0400, Robert Haas wrote:

I got curious and looked at the underlying problem here and I am
wondering whether HaveRegisteredOrActiveSnapshot() is just buggy. It
seems to me that the code is always going to return true if there are
any active snapshots, and the rest of the function is intended to test
whether there is a registered snapshot other than the catalog
snapshot. But I don't think that's what this code does:

if (pairingheap_is_empty(&RegisteredSnapshots) ||
!pairingheap_is_singular(&RegisteredSnapshots))
return false;

return CatalogSnapshot == NULL;

Certainly looks off...

I find that 'make check-world' passes with this change, which is
disturbing, because it also passes without this change. That means we
don't have any tests that reach HaveRegisteredOrActiveSnapshot() with
more than one registered snapshot.

Part of that is because right now the assertion is placed "too deep" -
it should be much higher up, so it's reached even if there's not
actually a toast datum. But there's of other bugs preventing that :(. A
lot of bugs have been hidden by the existence of CatalogSnapshot (which
of course isn't something one actually can rely on).

Also, unless we have plans to use HaveRegisteredOrActiveSnapshot() in
more places,

I think we should, but there's the other bugs that need to be fixed
first :(. Namely that we have plenty places doing catalog accesses
without an active or registered snapshot :(.

Ah, we actually were debating some of these issues more recently, in:
/messages/by-id/20220311030721.olixpzcquqkw2qyt@alap3.anarazel.de
/messages/by-id/20220311021047.hgtqkrl6n52srvdu@alap3.anarazel.de

It looks like the same bug, and that the patch in this thread fixes
them. And that we need to backpatch the fix, right?

Greetings,

Andres Freund

#21Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#19)
Re: TRAP: FailedAssertion("HaveRegisteredOrActiveSnapshot()", File: "toast_internals.c", Line: 670, PID: 19403)

On Thu, Apr 14, 2022 at 12:54 PM Andres Freund <andres@anarazel.de> wrote:

Part of that is because right now the assertion is placed "too deep" -
it should be much higher up, so it's reached even if there's not
actually a toast datum. But there's of other bugs preventing that :(. A
lot of bugs have been hidden by the existence of CatalogSnapshot (which
of course isn't something one actually can rely on).

I am wondering whether ffaa44cb559db332baeee7d25dedd74a61974203 bet on
the wrong horse. When I invented CatalogSnapshot, I intended for it to
be an ephemeral snapshot that we'd keep for as long as we could safely
do so and then discard it as soon as there's any hint of a problem.
But that commit really takes the opposite approach, trying to keep
CatalogSnapshot valid for a longer period of time instead of just
chucking it.

Also, unless we have plans to use HaveRegisteredOrActiveSnapshot() in
more places,

I think we should, but there's the other bugs that need to be fixed
first :(. Namely that we have plenty places doing catalog accesses
without an active or registered snapshot :(.

Ouch.

I'm worried that that could cause additional bugs. Consider code using
GetOldestSnapshot() to check if tuples need to be preserved or such.

But there is no such code: GetOldestSnapshot() has only one caller.
And I don't think we'd add more just to do something like that,
because we have other mechanisms that are specifically designed for
testing whether tuples are prunable that are better suited to such
tasks. It should really only be used when you don't know which of the
backend's current snapshots ought to be used for something, but are
sure that using the oldest one will be good enough. And in that kind
of situation, it's hard to see why using the catalog snapshot would
ever be the right idea.

--
Robert Haas
EDB: http://www.enterprisedb.com

#22Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#21)
Re: TRAP: FailedAssertion("HaveRegisteredOrActiveSnapshot()", File: "toast_internals.c", Line: 670, PID: 19403)

Robert Haas <robertmhaas@gmail.com> writes:

I am wondering whether ffaa44cb559db332baeee7d25dedd74a61974203 bet on
the wrong horse. When I invented CatalogSnapshot, I intended for it to
be an ephemeral snapshot that we'd keep for as long as we could safely
do so and then discard it as soon as there's any hint of a problem.
But that commit really takes the opposite approach, trying to keep
CatalogSnapshot valid for a longer period of time instead of just
chucking it.

Not really following? ISTM the point of ffaa44cb5 was that if we don't
include the CatalogSnapshot in our advertised xmin, then the length
of time we can safely use it is *zero*.

But there is no such code: GetOldestSnapshot() has only one caller.
And I don't think we'd add more just to do something like that,
because we have other mechanisms that are specifically designed for
testing whether tuples are prunable that are better suited to such
tasks. It should really only be used when you don't know which of the
backend's current snapshots ought to be used for something, but are
sure that using the oldest one will be good enough. And in that kind
of situation, it's hard to see why using the catalog snapshot would
ever be the right idea.

What if the reason we need a snapshot is to detoast some toasted item
we read from a catalog with the CatalogSnapshot? There might not be
any other valid snapshot, so I don't think I buy your argument here.

... Unless your argument is that the session should always have an
older non-catalog snapshot, which I'm not sure whether I buy or not.
But if we do believe that, then adding mechanisms to force it to be so
could be an alternative solution. But why is that better than what
we're doing?

regards, tom lane

#23Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#22)
Re: TRAP: FailedAssertion("HaveRegisteredOrActiveSnapshot()", File: "toast_internals.c", Line: 670, PID: 19403)

Hi,

On 2022-04-14 13:36:51 -0400, Tom Lane wrote:

What if the reason we need a snapshot is to detoast some toasted item
we read from a catalog with the CatalogSnapshot? There might not be
any other valid snapshot, so I don't think I buy your argument here.

We definitely do have places doing that, but is it ever actually safe?
Part of the catalog access might cause cache invalidations to be
processed, which can invalidate the snapshot (including resetting
MyProc->xmin). Afaics we always would have push or register the
snapshot, either will copy the snapshot, I think?

Greetings,

Andres Freund

#24Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#22)
Re: TRAP: FailedAssertion("HaveRegisteredOrActiveSnapshot()", File: "toast_internals.c", Line: 670, PID: 19403)

On Thu, Apr 14, 2022 at 1:36 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

I am wondering whether ffaa44cb559db332baeee7d25dedd74a61974203 bet on
the wrong horse. When I invented CatalogSnapshot, I intended for it to
be an ephemeral snapshot that we'd keep for as long as we could safely
do so and then discard it as soon as there's any hint of a problem.
But that commit really takes the opposite approach, trying to keep
CatalogSnapshot valid for a longer period of time instead of just
chucking it.

Not really following? ISTM the point of ffaa44cb5 was that if we don't
include the CatalogSnapshot in our advertised xmin, then the length
of time we can safely use it is *zero*.

No, I don't think so. I'm proposing that you shouldn't be taking a
catalog snapshot unless you already hold some other snapshot, and that
the catalog snapshot should always be newer than whatever that other
snapshot is. If we made that be true, then your catalog snapshot
couldn't ever be the thing holding back xmin -- and then I think it
doesn't need to be registered.

But there is no such code: GetOldestSnapshot() has only one caller.
And I don't think we'd add more just to do something like that,
because we have other mechanisms that are specifically designed for
testing whether tuples are prunable that are better suited to such
tasks. It should really only be used when you don't know which of the
backend's current snapshots ought to be used for something, but are
sure that using the oldest one will be good enough. And in that kind
of situation, it's hard to see why using the catalog snapshot would
ever be the right idea.

What if the reason we need a snapshot is to detoast some toasted item
we read from a catalog with the CatalogSnapshot? There might not be
any other valid snapshot, so I don't think I buy your argument here.

... Unless your argument is that the session should always have an
older non-catalog snapshot, which I'm not sure whether I buy or not.
But if we do believe that, then adding mechanisms to force it to be so
could be an alternative solution. But why is that better than what
we're doing?

That is exactly my argument, but I'm not actually sure whether it is
in fact better. I was responding to Andres's statement that
CatalogSnapshot was hiding a lot of bugs because it makes it look like
we have a snapshot when we don't really. And my point is that
ffaa44cb559db332baeee7d25dedd74a61974203 made such problems a lot more
likely, because before that the snapshot was not in
RegisteredSnapshots, and therefore anybody asking "hey, do we have a
snapshot?" wasn't likely to get confused, because they would only see
the catalog snapshot if they specifically went and looked at
CatalogSnapshot(Set), and if you do that, hopefully you'll think about
the special semantics of that snapshot and write code that works with
them. But with CatalogSnapshot in RegisteredSnapshots, any code that
looks through RegisteredSnapshots has to worry about whether what it's
finding there is actually just the CatalogSnapshot, and if Andres's
statement that we have a lot of bugs here is to be believed, then we
have not done a good job finding and updating all such code. We can
continue down the path of finding and fixing it -- or we can back out
parts of ffaa44cb559db332baeee7d25dedd74a61974203.

Just to be clear, I'm not debating that that commit fixed some real
problems and I think parts of it are really necessary fixes. But to
quote from the commit message:

The CatalogSnapshot was not plugged into SnapshotResetXmin()'s accounting
for whether MyPgXact->xmin could be cleared or advanced. In normal
transactions this was masked by the fact that the transaction snapshot
would be older, but during backend startup and certain utility commands
it was possible to re-use the CatalogSnapshot after MyPgXact->xmin had
been cleared ...

And what I'm suggesting is that *perhaps* we ought to have fixed those
"during backend startup and certain utility commands" by having
SnapshotResetXmin() do InvalidateCatalogSnapshot(), and maybe also
made the code in those commands that's doing catalog lookups hold
acquire and hold a snapshot of its own around the operation. The
alternative you chose, namely, incorporating the xmin into the
backend's xmin computation, I think can also be made to work. I am
just not sure that it's the best approach.

--
Robert Haas
EDB: http://www.enterprisedb.com

#25Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#19)
Re: TRAP: FailedAssertion("HaveRegisteredOrActiveSnapshot()", File: "toast_internals.c", Line: 670, PID: 19403)

Andres Freund <andres@anarazel.de> writes:

On 2022-04-14 12:16:45 -0400, Robert Haas wrote:

I got curious and looked at the underlying problem here and I am
wondering whether HaveRegisteredOrActiveSnapshot() is just buggy. It
seems to me that the code is always going to return true if there are
any active snapshots, and the rest of the function is intended to test
whether there is a registered snapshot other than the catalog
snapshot. But I don't think that's what this code does:

if (pairingheap_is_empty(&RegisteredSnapshots) ||
!pairingheap_is_singular(&RegisteredSnapshots))
return false;

return CatalogSnapshot == NULL;

Certainly looks off...

Yeah, this is broken. Whilst waiting around for a build on wrasse's
host, I reproduced the problem shown in this thread, and here's what
I see at the point of the exception:

(gdb) p RegisteredSnapshots
$5 = {ph_compare = 0x9a6000 <xmin_cmp>, ph_arg = 0x0,
ph_root = 0xec3168 <CatalogSnapshotData+72>}
(gdb) p *RegisteredSnapshots.ph_root
$6 = {first_child = 0x2d85d70, next_sibling = 0x0, prev_or_parent = 0x0}
(gdb) p CatalogSnapshotData
$7 = {snapshot_type = SNAPSHOT_MVCC, xmin = 52155, xmax = 52155,
xip = 0x2d855b0, xcnt = 0, subxip = 0x2de9130, subxcnt = 0,
suboverflowed = false, takenDuringRecovery = false, copied = false,
curcid = 0, speculativeToken = 0, vistest = 0x0, active_count = 0,
regd_count = 0, ph_node = {first_child = 0x2d85d70, next_sibling = 0x0,
prev_or_parent = 0x0}, whenTaken = 0, lsn = 0, snapXactCompletionCount = 1}
(gdb) p CatalogSnapshot
$8 = (Snapshot) 0xec3120 <CatalogSnapshotData>
(gdb) p *(Snapshot) (0x2d85d70-72)
$9 = {snapshot_type = SNAPSHOT_MVCC, xmin = 52155, xmax = 52155, xip = 0x0,
xcnt = 0, subxip = 0x0, subxcnt = 0, suboverflowed = false,
takenDuringRecovery = false, copied = true, curcid = 0,
speculativeToken = 0, vistest = 0x0, active_count = 0, regd_count = 2,
ph_node = {first_child = 0x0, next_sibling = 0x0,
prev_or_parent = 0xec3168 <CatalogSnapshotData+72>}, whenTaken = 0,
lsn = 0, snapXactCompletionCount = 0}
(gdb) p ActiveSnapshot
$10 = (ActiveSnapshotElt *) 0x0

So in fact there IS another registered snapshot, and
HaveRegisteredOrActiveSnapshot is just lying. I think the
correct test is more nearly what we have in
InvalidateCatalogSnapshotConditionally:

if (CatalogSnapshot &&
ActiveSnapshot == NULL &&
pairingheap_is_singular(&RegisteredSnapshots))
// then the CatalogSnapshot is the only one.

Ergo, this actually is a bug in 277692220.

regards, tom lane

#26Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#24)
Re: TRAP: FailedAssertion("HaveRegisteredOrActiveSnapshot()", File: "toast_internals.c", Line: 670, PID: 19403)

Robert Haas <robertmhaas@gmail.com> writes:

On Thu, Apr 14, 2022 at 1:36 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Not really following? ISTM the point of ffaa44cb5 was that if we don't
include the CatalogSnapshot in our advertised xmin, then the length
of time we can safely use it is *zero*.

No, I don't think so. I'm proposing that you shouldn't be taking a
catalog snapshot unless you already hold some other snapshot, and that
the catalog snapshot should always be newer than whatever that other
snapshot is. If we made that be true, then your catalog snapshot
couldn't ever be the thing holding back xmin -- and then I think it
doesn't need to be registered.

If you don't register it, then you need to also make sure that it's
destroyed whenever that older snapshot is. Which I think would require
either a lot of useless/inefficient CatalogSnapshot destructions, or
infrastructure that's more or less isomorphic to the RegisteredSnapshot
heap.

That is exactly my argument, but I'm not actually sure whether it is
in fact better. I was responding to Andres's statement that
CatalogSnapshot was hiding a lot of bugs because it makes it look like
we have a snapshot when we don't really.

Well, we DO have a snapshot, and it is 100% perfectly safe to use, if it's
registered. Andres' complaint is that that snapshot might get invalidated
when you weren't expecting it, but I'm not really convinced that we have
all that many bugs of that ilk. Wouldn't CLOBBER_CACHE_ALWAYS testing
find them?

regards, tom lane

#27Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#26)
Re: TRAP: FailedAssertion("HaveRegisteredOrActiveSnapshot()", File: "toast_internals.c", Line: 670, PID: 19403)

On Thu, Apr 14, 2022 at 3:05 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

If you don't register it, then you need to also make sure that it's
destroyed whenever that older snapshot is. Which I think would require
either a lot of useless/inefficient CatalogSnapshot destructions, or
infrastructure that's more or less isomorphic to the RegisteredSnapshot
heap.

Well, if that's true, then I agree that it's a good argument against
that approach. But I guess I'm confused as to why we'd end up in that
situation. Suppose we do these two things:

1. Decree that SnapshotResetXmin calls InvalidateCatalogSnapshot. It's
the other way around right now, but that's only because we're
registering the catalog snapshot.
2. Bomb out in GetCatalogSnapshot if you don't have an active or
registered snapshot already.

Is there some reason we'd need any more infrastructure than that?

Well, we DO have a snapshot, and it is 100% perfectly safe to use, if it's
registered. Andres' complaint is that that snapshot might get invalidated
when you weren't expecting it, but I'm not really convinced that we have
all that many bugs of that ilk. Wouldn't CLOBBER_CACHE_ALWAYS testing
find them?

Hmm, that's a good question. I don't know.

--
Robert Haas
EDB: http://www.enterprisedb.com

#28Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#27)
Re: TRAP: FailedAssertion("HaveRegisteredOrActiveSnapshot()", File: "toast_internals.c", Line: 670, PID: 19403)

Robert Haas <robertmhaas@gmail.com> writes:

On Thu, Apr 14, 2022 at 3:05 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

If you don't register it, then you need to also make sure that it's
destroyed whenever that older snapshot is.

Well, if that's true, then I agree that it's a good argument against
that approach. But I guess I'm confused as to why we'd end up in that
situation. Suppose we do these two things:

1. Decree that SnapshotResetXmin calls InvalidateCatalogSnapshot. It's
the other way around right now, but that's only because we're
registering the catalog snapshot.
2. Bomb out in GetCatalogSnapshot if you don't have an active or
registered snapshot already.

Is there some reason we'd need any more infrastructure than that?

Yes.

1. Create snapshot 1 (beginning of transaction).
2. Create catalog snapshot (okay because of snapshot 1).
3. Create snapshot 2.
4. Destroy snapshot 1.
5. Catalog snapshot is still there and is now the oldest.

The implementation you propose would have to also forbid this sequence
of events, which is (a) difficult and (b) would add instability to the
system, since there's really no reason that this should be Not OK.

I'm basically not on board with adding complication to make the system
less performant and more brittle, and I don't see how the direction
you want to go isn't that.

(BTW, this thought experiment also puts a hole in the test added by
277692220: even if HaveRegisteredOrActiveSnapshot were doing what
it claims to do, it would allow use of the catalog snapshot for
detoasting after step 4, which I suppose is not what Andres intended.)

regards, tom lane

#29Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#26)
Re: TRAP: FailedAssertion("HaveRegisteredOrActiveSnapshot()", File: "toast_internals.c", Line: 670, PID: 19403)

Hi,

On 2022-04-14 15:05:50 -0400, Tom Lane wrote:

Andres' complaint is that that snapshot might get invalidated when you
weren't expecting it, but I'm not really convinced that we have all
that many bugs of that ilk. Wouldn't CLOBBER_CACHE_ALWAYS testing
find them?

Don't see why it would - we don't have any mechanism in place for
enforcing that we don't update / delete a tuple we've looked up with an
xmin that wasn't continually enforced. A typical pattern is to use a
catalog cache (registered an all) for a syscache lookup, but then not
have a registered / active snapshot until an eventual update / delete
(after the syscache scan ends). Which isn't safe, because without a
MyProc->xmin set, the tuple we're updating / deleting could be updated,
removed and replaced with another tuple.

Greetings,

Andres Freund

#30Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#28)
Re: TRAP: FailedAssertion("HaveRegisteredOrActiveSnapshot()", File: "toast_internals.c", Line: 670, PID: 19403)

On Thu, Apr 14, 2022 at 4:26 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Well, if that's true, then I agree that it's a good argument against
that approach. But I guess I'm confused as to why we'd end up in that
situation. Suppose we do these two things:

1. Decree that SnapshotResetXmin calls InvalidateCatalogSnapshot. It's
the other way around right now, but that's only because we're
registering the catalog snapshot.
2. Bomb out in GetCatalogSnapshot if you don't have an active or
registered snapshot already.

Is there some reason we'd need any more infrastructure than that?

Yes.

1. Create snapshot 1 (beginning of transaction).
2. Create catalog snapshot (okay because of snapshot 1).
3. Create snapshot 2.
4. Destroy snapshot 1.
5. Catalog snapshot is still there and is now the oldest.

Sorry, I'm not seeing the problem. If we call SnapshotResetXmin()
after step 4, then the catalog snapshot would get invalidated under my
proposal. If we don't, then our advertised xmin has not changed and
nothing can be pruned out from under us.

I'm basically not on board with adding complication to make the system
less performant and more brittle, and I don't see how the direction
you want to go isn't that.

Well ... I agree that brittle is bad, but it's not clear to me which
way is actually less brittle. As for performant, I think you might be
misjudging the situation. My original design for removing SnapshotNow
didn't even have the catalog snapshot - it just took a new snapshot
every time. That was mostly fine, but Andres found a somewhat extreme
test case where it exhibited a significant regression, so I added the
catalog snapshot stuff to work around that. So I'm not AT ALL
convinced that giving catalog snapshots longer lifetimes is a relevant
thing to do. There's some value in it if you can construct a test case
where the overall rate of snapshot taking is extremely high, but in
normal cases that isn't so. It's certainly not worth complicating the
code for backend startup or DDL commands to reduce the number of
snapshots, because you're never going to have those things happening
at a high enough rate to matter, or so I believe.

The way you get a benefit from CatalogSnapshot is to construct a
workload with a lot of really cheap SQL statements each of which has
to do a bunch of catalog lookups, and then run that at high
concurrency. The concurrency makes taking snapshots more expensive,
because the cache lines are contended, and having the same command use
the same snapshot over and over again instead of taking new ones then
brings the cost down enough to be measurable. But people run SQL
statements in a tight loop, not DDL or backend startup.

--
Robert Haas
EDB: http://www.enterprisedb.com

#31Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#30)
Re: TRAP: FailedAssertion("HaveRegisteredOrActiveSnapshot()", File: "toast_internals.c", Line: 670, PID: 19403)

Robert Haas <robertmhaas@gmail.com> writes:

Well ... I agree that brittle is bad, but it's not clear to me which
way is actually less brittle. As for performant, I think you might be
misjudging the situation. My original design for removing SnapshotNow
didn't even have the catalog snapshot - it just took a new snapshot
every time. That was mostly fine, but Andres found a somewhat extreme
test case where it exhibited a significant regression, so I added the
catalog snapshot stuff to work around that. So I'm not AT ALL
convinced that giving catalog snapshots longer lifetimes is a relevant
thing to do.

Perhaps not. But right now we have to first think about correctness and
how much trouble it will be to get to correctness. ISTM you are arguing
for a design in which it's required that there is always a registered
or active snapshot that's older than the catalog snapshot (if any).
I tried revising snapmgr.c to enforce that, starting with adding

@@ -421,6 +421,13 @@ GetNonHistoricCatalogSnapshot(Oid relid)

     if (CatalogSnapshot == NULL)
     {
+        /*
+         * The catalog snapshot must always be newer than some active or
+         * registered snapshot.  (XXX explain why)
+         */
+        Assert(ActiveSnapshot != NULL ||
+               !pairingheap_is_empty(&RegisteredSnapshots));
+
         /* Get new snapshot. */
         CatalogSnapshot = GetSnapshotData(&CatalogSnapshotData);

and this blew up with truly impressive thoroughness. The autovac
launcher, logical replication launcher, and incoming backends all
fail this assertion instantly, making it impossible to find out
what else might be broken --- but I'm sure there is a lot.

(If you want to try this for yourself, remember that the postmaster
will relaunch the AV and LR launchers on failure, meaning that your
disk will fill with core files very very quickly. Just sayin'.)

So maybe we can go that direction, but it's going to require a lot of
code additions to push extra snapshots in places that haven't bothered
to date; and I'm not convinced that that'd be buying us anything
except more GetSnapshotData() calls.

Plan B is to grant catalog snapshots some more-durable status than
what Plan A envisions. I'm not very sure about the details, but if
you don't want to go that route then you'd better set about making
the above assertion work.

In the meantime, since it's clear that HaveRegisteredOrActiveSnapshot
is failing to meet its contract, I'm going to go fix it. I think
(based on the above argument) that what it intends to enforce is not
really the system design we need, but it certainly isn't helping
anyone that it enforces that design incorrectly.

regards, tom lane

#32Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#31)
Re: TRAP: FailedAssertion("HaveRegisteredOrActiveSnapshot()", File: "toast_internals.c", Line: 670, PID: 19403)

Hi,

On 2022-04-16 14:42:39 -0400, Tom Lane wrote:

In the meantime, since it's clear that HaveRegisteredOrActiveSnapshot
is failing to meet its contract, I'm going to go fix it.

+1

I think (based on the above argument) that what it intends to enforce
is not really the system design we need, but it certainly isn't
helping anyone that it enforces that design incorrectly.

I think it's approximately right for the current caller. But that caller
likely needs an improved design around snapshots...

Greetings,

Andres Freund

#33Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#32)
Re: TRAP: FailedAssertion("HaveRegisteredOrActiveSnapshot()", File: "toast_internals.c", Line: 670, PID: 19403)

Andres Freund <andres@anarazel.de> writes:

On 2022-04-16 14:42:39 -0400, Tom Lane wrote:

I think (based on the above argument) that what it intends to enforce
is not really the system design we need, but it certainly isn't
helping anyone that it enforces that design incorrectly.

I think it's approximately right for the current caller. But that caller
likely needs an improved design around snapshots...

Yeah, I think the real issue is that checking
HaveRegisteredOrActiveSnapshot in this way doesn't provide a very
good guarantee of what we really want to know, which is that the
session's advertised xmin is old enough to prevent removal of
whatever toast data we're trying to fetch. The fact that we
have a snapshot at the instant of fetch doesn't prove that it
existed continually since we fetched the toast reference, which
seems to be the condition we actually need to assure. (And TBH
I see little reason to think that whether the snapshot is the
CatalogSnapshot or not changes things in any meaningful way.)

I don't yet see a practical way to check for the real concern.
While it's something to worry about, there's no reason to think
that v15 is any worse than prior versions in this area, is there?
So I'm inclined to remove this from the list of v15 open items,
or at least demote the remaining concern to "older bug" status.

regards, tom lane

#34Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#31)
Re: TRAP: FailedAssertion("HaveRegisteredOrActiveSnapshot()", File: "toast_internals.c", Line: 670, PID: 19403)

On Sat, Apr 16, 2022 at 2:42 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

and this blew up with truly impressive thoroughness. The autovac
launcher, logical replication launcher, and incoming backends all
fail this assertion instantly, making it impossible to find out
what else might be broken --- but I'm sure there is a lot.

OK, thanks for trying that.

In the meantime, since it's clear that HaveRegisteredOrActiveSnapshot
is failing to meet its contract, I'm going to go fix it. I think
(based on the above argument) that what it intends to enforce is not
really the system design we need, but it certainly isn't helping
anyone that it enforces that design incorrectly.

+1.

--
Robert Haas
EDB: http://www.enterprisedb.com

#35Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#33)
Re: TRAP: FailedAssertion("HaveRegisteredOrActiveSnapshot()", File: "toast_internals.c", Line: 670, PID: 19403)

Hi,

On 2022-04-17 11:51:58 -0400, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

On 2022-04-16 14:42:39 -0400, Tom Lane wrote:

I think (based on the above argument) that what it intends to enforce
is not really the system design we need, but it certainly isn't
helping anyone that it enforces that design incorrectly.

I think it's approximately right for the current caller. But that caller
likely needs an improved design around snapshots...

Yeah, I think the real issue is that checking
HaveRegisteredOrActiveSnapshot in this way doesn't provide a very
good guarantee of what we really want to know, which is that the
session's advertised xmin is old enough to prevent removal of
whatever toast data we're trying to fetch.

Right. It's better than what was there before though - I added
HaveRegisteredOrActiveSnapshot() in the course of
7c38ef2a5d6cf6d8dc3834399d7a1c364d64ce64. Where the problem was that we
didn't have *any* snapshot other than the catalog snapshot, and the
catalog snapshot only sometimes (iirc for that bug it depended on the
order in which objects were deleted). That makes such bugs much harder
to detect.

The fact that we have a snapshot at the instant of fetch doesn't prove
that it existed continually since we fetched the toast reference,
which seems to be the condition we actually need to assure.

Right.

(And TBH I see little reason to think that whether the snapshot is the
CatalogSnapshot or not changes things in any meaningful way.)

It is a meaningful difference, see e.g. the bug referenced above.

I don't yet see a practical way to check for the real concern.
While it's something to worry about, there's no reason to think
that v15 is any worse than prior versions in this area, is there?
So I'm inclined to remove this from the list of v15 open items,
or at least demote the remaining concern to "older bug" status.

Yes.

Greetings,

Andres Freund

#36Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#35)
Re: TRAP: FailedAssertion("HaveRegisteredOrActiveSnapshot()", File: "toast_internals.c", Line: 670, PID: 19403)

On Mon, Apr 18, 2022 at 10:39 AM Andres Freund <andres@anarazel.de> wrote:

Right. It's better than what was there before though - I added
HaveRegisteredOrActiveSnapshot() in the course of
7c38ef2a5d6cf6d8dc3834399d7a1c364d64ce64. Where the problem was that we
didn't have *any* snapshot other than the catalog snapshot, and the
catalog snapshot only sometimes (iirc for that bug it depended on the
order in which objects were deleted). That makes such bugs much harder
to detect.

I still think it would be better to have GetOldestSnapshot() be
smarter and refuse to return the catalog snapshot. For one thing, that
way we'd be testing for the problem case in non-assert builds also.

--
Robert Haas
EDB: http://www.enterprisedb.com

#37Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#35)
Re: TRAP: FailedAssertion("HaveRegisteredOrActiveSnapshot()", File: "toast_internals.c", Line: 670, PID: 19403)

Andres Freund <andres@anarazel.de> writes:

On 2022-04-17 11:51:58 -0400, Tom Lane wrote:

The fact that we have a snapshot at the instant of fetch doesn't prove
that it existed continually since we fetched the toast reference,
which seems to be the condition we actually need to assure.

Right.

(And TBH I see little reason to think that whether the snapshot is the
CatalogSnapshot or not changes things in any meaningful way.)

It is a meaningful difference, see e.g. the bug referenced above.

Well, that's true given the current arrangements for managing
CatalogSnapshot; but that doesn't make the CatalogSnapshot any
less of a protection when it exists. The direction I was vaguely
imagining is that we create some refcount-like infrastructure
directly ensuring that once a snapshot is used to read a toast
reference, it gets kept around until we dereference or discard
that reference. With a scheme like that, there'd be no reason to
discriminate against a CatalogSnapshot as being the protective
snapshot.

(I hasten to add that I have no idea how to make this half-baked
plan work, and there may be better solutions anyway.)

While it's something to worry about, there's no reason to think
that v15 is any worse than prior versions in this area, is there?
So I'm inclined to remove this from the list of v15 open items,
or at least demote the remaining concern to "older bug" status.

Yes.

OK, I'll update the open-items page.

regards, tom lane

#38Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#36)
Re: TRAP: FailedAssertion("HaveRegisteredOrActiveSnapshot()", File: "toast_internals.c", Line: 670, PID: 19403)

Robert Haas <robertmhaas@gmail.com> writes:

I still think it would be better to have GetOldestSnapshot() be
smarter and refuse to return the catalog snapshot. For one thing, that
way we'd be testing for the problem case in non-assert builds also.

I was wondering about that too. On the other hand, given that
we know this area is squishy, transforming fails-in-assert-builds
to fails-everywhere is not necessarily desirable.

regards, tom lane

#39Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#38)
Re: TRAP: FailedAssertion("HaveRegisteredOrActiveSnapshot()", File: "toast_internals.c", Line: 670, PID: 19403)

On Mon, Apr 18, 2022 at 10:53 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

I still think it would be better to have GetOldestSnapshot() be
smarter and refuse to return the catalog snapshot. For one thing, that
way we'd be testing for the problem case in non-assert builds also.

I was wondering about that too. On the other hand, given that
we know this area is squishy, transforming fails-in-assert-builds
to fails-everywhere is not necessarily desirable.

I agree that it's a little unclear. In general, I think if we're going
to blow up and die, doing it closer to the place where the problem is
happening is for the best. On the other hand, if in most practical
cases we're going to stumble through and get the right answer anyway,
then it's maybe not great to break a bunch of accidentally-working
cases. However, it does strikes me that this principal could easily be
overdone. init_toast_snapshot() could pick a random snapshot (or take
a new one) in order to call InitToastSnapshot() and that would often
work fine. Yet, upon realizing that things are busted, it chooses to
error out instead. I approve of that choice, and don't think we should
rule out the idea of making that check more robust.

--
Robert Haas
EDB: http://www.enterprisedb.com

#40Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#39)
Re: TRAP: FailedAssertion("HaveRegisteredOrActiveSnapshot()", File: "toast_internals.c", Line: 670, PID: 19403)

Robert Haas <robertmhaas@gmail.com> writes:

I agree that it's a little unclear. In general, I think if we're going
to blow up and die, doing it closer to the place where the problem is
happening is for the best. On the other hand, if in most practical
cases we're going to stumble through and get the right answer anyway,
then it's maybe not great to break a bunch of accidentally-working
cases. However, it does strikes me that this principal could easily be
overdone. init_toast_snapshot() could pick a random snapshot (or take
a new one) in order to call InitToastSnapshot() and that would often
work fine. Yet, upon realizing that things are busted, it chooses to
error out instead. I approve of that choice, and don't think we should
rule out the idea of making that check more robust.

I'm all for improving robustness, but failing in cases that would have
worked before (even if only accidentally) is not going to be seen by
users as more robust. I think that this late stage of the development
cycle is not the time to be putting in changes that are not actually
going to fix bugs but only call greater attention to the possibility
that a bug exists.

TBH, given where we are in the dev cycle, I thought there was a lot of
sense behind your earlier thought that HaveRegisteredOrActiveSnapshot
should be reverted entirely. I'm okay with keeping it as an assertion-
only check, so that it won't bother end users. I'm not okay with
adding end-user-visible failures, at least not till early in v16.

regards, tom lane

#41Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#40)
Re: TRAP: FailedAssertion("HaveRegisteredOrActiveSnapshot()", File: "toast_internals.c", Line: 670, PID: 19403)

On Mon, Apr 18, 2022 at 11:26 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

I agree that it's a little unclear. In general, I think if we're going
to blow up and die, doing it closer to the place where the problem is
happening is for the best. On the other hand, if in most practical
cases we're going to stumble through and get the right answer anyway,
then it's maybe not great to break a bunch of accidentally-working
cases. However, it does strikes me that this principal could easily be
overdone. init_toast_snapshot() could pick a random snapshot (or take
a new one) in order to call InitToastSnapshot() and that would often
work fine. Yet, upon realizing that things are busted, it chooses to
error out instead. I approve of that choice, and don't think we should
rule out the idea of making that check more robust.

I'm all for improving robustness, but failing in cases that would have
worked before (even if only accidentally) is not going to be seen by
users as more robust. I think that this late stage of the development
cycle is not the time to be putting in changes that are not actually
going to fix bugs but only call greater attention to the possibility
that a bug exists.

TBH, given where we are in the dev cycle, I thought there was a lot of
sense behind your earlier thought that HaveRegisteredOrActiveSnapshot
should be reverted entirely. I'm okay with keeping it as an assertion-
only check, so that it won't bother end users. I'm not okay with
adding end-user-visible failures, at least not till early in v16.

I wasn't really taking a position either way about timing. If we can
demonstrate that things other than HaveRegisteredOrActiveSnapshot()
itself are misbehaving, then I think fixes for those bugs are
potentially back-patchable no matter where we are in the release
cycle, but in terms of when we make changes to try to detect bugs we
don't know about yet, I could go either way on whether to do that now
or wait. We can't know whether the bugs we haven't found yet will
cause a big problem for someone tomorrow, ten years from now, or
never.

I am not really very happy about HaveRegisteredOrActiveSnapshot(),
honestly. I think in the form we have it in the tree it looks
under-engineered. It's not really testing the right thing (even
leaving aside the bug fix) as you have been pointing out, it doesn't
really mesh well with the sanity checking that was there before as I
have been pointing out, and it's only used in one place. I wouldn't be
sad if it got reverted. However, I don't think it's going to do us any
great harm, either. Although it's a long way from the best thing we
have in the tree, it's also a long way from the worst thing we have in
the tree.

--
Robert Haas
EDB: http://www.enterprisedb.com

#42Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#41)
Re: TRAP: FailedAssertion("HaveRegisteredOrActiveSnapshot()", File: "toast_internals.c", Line: 670, PID: 19403)

Robert Haas <robertmhaas@gmail.com> writes:

I wasn't really taking a position either way about timing. If we can
demonstrate that things other than HaveRegisteredOrActiveSnapshot()
itself are misbehaving, then I think fixes for those bugs are
potentially back-patchable no matter where we are in the release
cycle,

Sure, but ...

but in terms of when we make changes to try to detect bugs we
don't know about yet, I could go either way on whether to do that now
or wait. We can't know whether the bugs we haven't found yet will
cause a big problem for someone tomorrow, ten years from now, or
never.

... I think in this case we do have a pretty good idea of the possible
consequences. Most of the time, an unsafe toast fetch will work
fine because the toast data is still there. If you're very unlucky
then it's been deleted, and vacuumed away, and then you get a "missing
chunk number" error. If you're really astronomically unlucky, perhaps
the toast OID has been recycled and you get the wrong data (it's not
clear to me whether the toast snapshot visibility rules would prevent
this). I doubt we need to factor that last scenario into practical risk
estimates, though. So adding a non-assert check for snapshot misuse
would effectively convert "if you're very unlucky you get a weird error"
to "lucky or not, you get some other weird error", which no user is
going to see as an improvement.

I am not really very happy about HaveRegisteredOrActiveSnapshot(),
honestly.

Me either. If we find any other cases where it gives a false positive,
I'll be for removing it rather than fixing it. But for the moment
I'm content to leave it until we have a well-engineered solution to
the real problem.

regards, tom lane

#43Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#35)
Re: TRAP: FailedAssertion("HaveRegisteredOrActiveSnapshot()", File: "toast_internals.c", Line: 670, PID: 19403)

Andres Freund <andres@anarazel.de> writes:

On 2022-04-17 11:51:58 -0400, Tom Lane wrote:

The fact that we have a snapshot at the instant of fetch doesn't prove
that it existed continually since we fetched the toast reference,
which seems to be the condition we actually need to assure.

Right.

BTW, after thinking about this for a bit I am less concerned than
I was about the system being full of bugs of this ilk. The executor
per se should be fine because it does everything under a live snapshot.
We had bugs with cases that shove executor output into long-lived
tuplestores, but we've dealt with that scenario. Catalog updates
performed on tuples fetched from a catalog scan seem safe enough too.
Andres was worried about catalog updates performed using tuples fetched
from catcache, but that's not a problem because we detoasted every value
when it went into the catcache, cf 08e261cbc.

(Mind you, 08e261cbc's solution is risky performancewise, because it
means we have to re-toast every value during such catalog updates,
instead of being able to carry the old values of unchanged columns
forward. But it's not a correctness bug.)

(Also, the whining I did in 08e261cbc's commit message is no longer
relevant now that we read catalogs with MVCC snapshots.)

There may be some corner cases that aren't described by any of these
three blanket scenarios, but they've got to be pretty few and far
between.

regards, tom lane

#44Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#43)
Re: TRAP: FailedAssertion("HaveRegisteredOrActiveSnapshot()", File: "toast_internals.c", Line: 670, PID: 19403)

On Mon, Apr 18, 2022 at 4:07 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

There may be some corner cases that aren't described by any of these
three blanket scenarios, but they've got to be pretty few and far
between.

My first thought whenever anything like this comes up is cursors,
especially but not only holdable cursors. Also, plpgsql variables,
maybe mixed with embedded COMMIT/ROLLBACK. I don't find it
particularly hard to believe we have some bugs in
insufficiently-well-considered parts of the system that pass around
datums outside of the normal executor flow, but I don't know exactly
how to find them all, either.

--
Robert Haas
EDB: http://www.enterprisedb.com

#45Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#44)
Re: TRAP: FailedAssertion("HaveRegisteredOrActiveSnapshot()", File: "toast_internals.c", Line: 670, PID: 19403)

Robert Haas <robertmhaas@gmail.com> writes:

On Mon, Apr 18, 2022 at 4:07 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

There may be some corner cases that aren't described by any of these
three blanket scenarios, but they've got to be pretty few and far
between.

My first thought whenever anything like this comes up is cursors,
especially but not only holdable cursors. Also, plpgsql variables,
maybe mixed with embedded COMMIT/ROLLBACK.

Those exact cases have had detoasting bugs in the past and are now fixed.

I don't find it
particularly hard to believe we have some bugs in
insufficiently-well-considered parts of the system that pass around
datums outside of the normal executor flow, but I don't know exactly
how to find them all, either.

I'm not here to claim that there are precisely zero remaining bugs
of this ilk. I'm just saying that I think we've flushed out most
of them. I think there is some value in trying to think of a way
to prove that none remain, but it's not a problem we can solve
for v15.

regards, tom lane

#46Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#45)
Re: TRAP: FailedAssertion("HaveRegisteredOrActiveSnapshot()", File: "toast_internals.c", Line: 670, PID: 19403)

On Tue, Apr 19, 2022 at 10:36 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

I'm not here to claim that there are precisely zero remaining bugs
of this ilk. I'm just saying that I think we've flushed out most
of them. I think there is some value in trying to think of a way
to prove that none remain, but it's not a problem we can solve
for v15.

Sure, that's fine.

--
Robert Haas
EDB: http://www.enterprisedb.com