Bugs in TOAST handling, OID assignment and redo recovery

Started by Pavan Deolaseealmost 8 years ago20 messages

pavan.deolasee@gmail.com

almost 8 years ago

2 attachment(s)

Hello,

One of our 2ndQuadrant support customers recently reported a sudden rush of
TOAST errors post a crash recovery, nearly causing an outage. Most errors
read like this:

ERROR: unexpected chunk number 0 (expected 1) for toast value nnnn

While we could bring back the cluster to normal quickly using some
workarounds, I investigated this in more detail and identified two long
standing bugs in TOAST as well as redo recovery.

Immediately after the crash recovery, server started issuing OIDs that
conflicted with the OIDs issued just prior to the crash. The XLOG_NEXTOID
and buffer LSN interlock is supposed to guard against that, but a bug in
the redo recovery breaks that protection. We shall see that in a minute.

When duplicate OIDs are issued, either because of a bug in redo recovery or
because of OID counter wraparound, GetNewOidWithIndex() is supposed to
protect us against that by first scanning the toast table to check if a
duplicate chunk_id or toast value already exists. If so, a new OID is
requested and we repeat until a non-conflicting OID is found.

This mostly works, except in one corner case. GetNewOidWithIndex() uses
SnapshotDirty to look for conflicting OID. But this fails if the existing
toast tuple was recently deleted by another committed transaction. If the
OldestXmin has not advanced enough, such tuples are reported as
RECENTLY_DEAD and fail the SnapshotDirty qualifications.
So toast_save_datum() happily goes ahead and inserts a conflicting toast
tuple. The UNIQUE index on <chunk_id, chunk_seq> does not complain because
it seems the other tuple as RECENTLY_DEAD too. At this point, we have two
toast tuples with the same <chunk_id, chunk_seq> - one of these is
RECENTLY_DEAD and the other is LIVE. Later when toast_fetch_datum() and
friends scan the toast table for the chunk_id (toast value), it uses
SnapshotToast for retrieving the tuple. This special snapshot doesn't do
any visibility checks on the toast tuple, assuming that since the main heap
tuple is visible, the toast tuple(s) should be visible too. This snapshot
unfortunately also sees the RECENTLY_DEAD toast tuple and thus returns
duplicate toast tuples, leading to the errors such as above. This same
issue can also lead to other kinds of errors in tuptoaster.c.

ISTM that the right fix is to teach toast_save_datum() to check for
existence of duplicate chunk_id by scanning the table with the same
SnapshotToast that it later uses to fetch the tuples. We already do that in
case of toast rewrite, but not for regular inserts. I propose to do that
for regular path too, as done in the attached patch.

Duplicate OIDs may be generated when the counter wraps around and it may
cause this problem.

- OID counter wraps around
- Old toast tuple is deleted and the deleting transaction commits
- OldestXmin is held back by an open transaction
- The same OID if then inserted again into the toast table
- The toasted value is fetched before the table is vacuumed or the index
pointer is killed. This leads to duplicates being returned.

This is probably quite a corner case, but not impossible given that we
already have reports for such problems. But the problem gets accentuated in
case of a crash recovery or on a Hot standby. This is because of another
somewhat unrelated bug which manifested in our case.

An interesting case to consider is the following sequence of events:

1. ONLINE checkpoint starts. We note down the nextOid counter, after
rightly accounting for the cached values too. CreateCheckPoint() does this:

LWLockAcquire(OidGenLock, LW_SHARED);
checkPoint.nextOid = ShmemVariableCache->nextOid;
if (!shutdown)
checkPoint.nextOid += ShmemVariableCache->oidCount;
LWLockRelease(OidGenLock);

2. The cached OIDs are then consumed, and a new set of OIDs are allocated,
generating XLOG_NEXTOID WAL record.

3. More OIDs are consumed from the newly allocated range. Those OIDs make
to the disk after following buffer LSN interlock. So XLOG_NEXTOID gets
flushed to disk as well.

4. Checkpoint finishes, it writes a ONLINE checkpoint record with the
nextOid value saved in step 1

5. CRASH

The redo recovery starts at some LSN written before step 1. It initialises
the ShmemVariableCache->nextOid with the value stored in the checkpoint
record and starts replaying WAL records. When it sees the XLOG_NEXTOID WAL
record generated at step 2, it correctly advances the OID counter to cover
the next range.

But while applying ONLINE checkpoint written at step 4, we do this:

/* ... but still treat OID counter as exact */
LWLockAcquire(OidGenLock, LW_EXCLUSIVE);
ShmemVariableCache->nextOid = checkPoint.nextOid;
ShmemVariableCache->oidCount = 0;
LWLockRelease(OidGenLock);

So effectively we overwrite the nextOid counter and drag it backwards
again. If we don't see another XLOG_NEXTOID record before the crash
recovery ends, then we shall start the server with a stale value,
generating duplicate OIDs immediately after the restart.

I don't think this is an unusual case and I'm surprised that we didn't
notice this before. Most likely the other protections in the system to
guard against wrapped around OIDs masked the bug. The TOAST table though
escapes those protections in the specific case we discussed above and
brought out the bug in open.

I think the right fix is to simply ignore the nextOid counter while
replaying ONLINE checkpoint record. We must have already initialised
ShmemVariableCache->nextOid
to the value stored in this (or some previous) checkpoint record when redo
recovery is started. As and when we see XLOG_NEXTOID record, we would
maintain the shared memory counter correctly. If we don't see any
XLOG_NEXTOID record, the value we started with must be the correct value. I
see no problem even when OID wraps around during redo recovery.
XLOG_NEXTOID should record that correctly.

Curiously neither REINDEX nor VACUUM would fix it until the OldestXmin
counter advances enough so that RECENTLY_DEAD tuples are ignored by the
REINDEX or removed by VACUUM. But after OldestXmin progresses, a REINDEX or
a VACUUM should fix these transient errors. Also once manifested, the
errors would stay until a REINDEX or VACUUM is performed.

We tested this at least upto 9.1 and the bugs exist there too. In fact,
these bugs probably existed forever, though I did not check very old
releases.

Attached is a simple reproducer and a proposed fix to address both the
bugs. We should consider backpatching it all supported releases.

Thanks,
Pavan

--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

pg_toast_corruption_v3.patchapplication/octet-stream; name=pg_toast_corruption_v3.patchDownload

diff --git a/src/backend/access/heap/tuptoaster.c b/src/backend/access/heap/tuptoaster.c
index 546f80f05c..01c392d546 100644
--- a/src/backend/access/heap/tuptoaster.c
+++ b/src/backend/access/heap/tuptoaster.c
@@ -1569,11 +1569,22 @@ toast_save_datum(Relation rel, Datum value,
 	 */
 	if (!OidIsValid(rel->rd_toastoid))
 	{
-		/* normal case: just choose an unused OID */
-		toast_pointer.va_valueid =
-			GetNewOidWithIndex(toastrel,
-							   RelationGetRelid(toastidxs[validIndex]),
-							   (AttrNumber) 1);
+		/*
+		 * Normal case: just choose an unused OID. But we must scan the TOAST
+		 * table using SnapshotToast to ensure that the value does not exists.
+		 * Note that GetNewOidWithIndex() scans the table with SnapshotDirty
+		 * and that may not see a RECENTLY_DEAD tuple. But such tuples are
+		 * visible to SnapshotToast and hence we must refrain from using such
+		 * OIDs. Otherwise toast_fetch_datum() or friends might see duplicate
+		 * OIDs, leading to all kinds of errors.
+		 */
+		do
+		{
+			toast_pointer.va_valueid =
+				GetNewOidWithIndex(toastrel,
+						RelationGetRelid(toastidxs[validIndex]),
+						(AttrNumber) 1);
+		} while (toastrel_valueid_exists(toastrel, toast_pointer.va_valueid));
 	}
 	else
 	{
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index b4fd8395b7..ba8b969cc2 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -9785,11 +9785,21 @@ xlog_redo(XLogReaderState *record)
 								  checkPoint.nextXid))
 			ShmemVariableCache->nextXid = checkPoint.nextXid;
 		LWLockRelease(XidGenLock);
-		/* ... but still treat OID counter as exact */
-		LWLockAcquire(OidGenLock, LW_EXCLUSIVE);
-		ShmemVariableCache->nextOid = checkPoint.nextOid;
-		ShmemVariableCache->oidCount = 0;
-		LWLockRelease(OidGenLock);
+
+		/*
+		 * In an ONLINE checkpoint, don't update the nextOid counter. If there
+		 * was no XLOG_NEXTOID record seen since the starting checkpoint, then
+		 * we continue to believe that the current value is correct. If we saw
+		 * a XLOG_NEXTOID record, then we should rather use that and not
+		 * overwrite with a potentially old value, thus issuing duplicate
+		 * OIDs when we are fully up.
+		 *
+		 * If you're still curious why that can happen, note that the online
+		 * checkpoint may have started before our original redo position and
+		 * hence it may have value later made stale by an explicit
+		 * XLOG_NEXTOID.
+		 */
+
 		MultiXactAdvanceNextMXact(checkPoint.nextMulti,
 								  checkPoint.nextMultiOffset);

toast_corruption.shtext/x-sh; charset=US-ASCII; name=toast_corruption.shDownload

Bernd Helmle

mailings@oopsware.de

almost 8 years ago

In reply to: Pavan Deolasee (#1)

Re: Bugs in TOAST handling, OID assignment and redo recovery

Am Dienstag, den 10.04.2018, 15:59 +0530 schrieb Pavan Deolasee:

One of our 2ndQuadrant support customers recently reported a sudden
rush of
TOAST errors post a crash recovery, nearly causing an outage. Most
errors
read like this:

ERROR: unexpected chunk number 0 (expected 1) for toast value nnnn

While we could bring back the cluster to normal quickly using some
workarounds, I investigated this in more detail and identified two
long
standing bugs in TOAST as well as redo recovery.

Wow! I'm currently also investigating issues on a customer system, too,
where suddenly TOAST errors arised after shutdown immediate.

I haven't dug into your findings yet, but it seems to perfectly fit
into the reports i got.

Thanks,
Bernd

Heikki Linnakangas

hlinnaka@iki.fi

almost 8 years ago

In reply to: Pavan Deolasee (#1)

Re: Bugs in TOAST handling, OID assignment and redo recovery

On 10/04/18 13:29, Pavan Deolasee wrote:

Hello,

One of our 2ndQuadrant support customers recently reported a sudden rush of
TOAST errors post a crash recovery, nearly causing an outage. Most errors
read like this:

ERROR: unexpected chunk number 0 (expected 1) for toast value nnnn

While we could bring back the cluster to normal quickly using some
workarounds, I investigated this in more detail and identified two long
standing bugs in TOAST as well as redo recovery.

Great detective work!

ISTM that the right fix is to teach toast_save_datum() to check for
existence of duplicate chunk_id by scanning the table with the same
SnapshotToast that it later uses to fetch the tuples. We already do that in
case of toast rewrite, but not for regular inserts. I propose to do that
for regular path too, as done in the attached patch.

It would seem more straightforward to add a snapshot parameter to
GetNewOidWithIndex(), so that the this one caller could pass
SnapshotToast, while others pass SnapshotDirty. With your patch, you
check the index twice: first with SnapshotDirty, in
GetNewOidWithIndex(), and then with SnapshotToast, in the caller.

If I'm reading the rewrite case correctly, it's a bit different and
quite special. In the loop with GetNewOidWithIndex(), it needs to check
that the OID is unused in two tables, the old TOAST table, and the new
one. You can only pass one index to GetNewOidWithIndex(), so it needs to
check the second index manually. It's not because of the snapshot issue.
Although I wonder if we should be using SnapshotToast in that
GetNewOidWithIndex() call, too. I.e. if we should be checking both the
old and the new toast table with SnapshotToast.

I think the right fix is to simply ignore the nextOid counter while
replaying ONLINE checkpoint record. We must have already initialised
ShmemVariableCache->nextOid
to the value stored in this (or some previous) checkpoint record when redo
recovery is started. As and when we see XLOG_NEXTOID record, we would
maintain the shared memory counter correctly. If we don't see any
XLOG_NEXTOID record, the value we started with must be the correct value. I
see no problem even when OID wraps around during redo recovery.
XLOG_NEXTOID should record that correctly.

Agreed. With nextXid, we advance ShmemVariableCache->nextXid if the
value in the online checkpoint record is greater than
ShmemVariableCache->nextXid. But we don't have such a wraparound-aware
concept of "greater than" for OIDs. Ignoring the online checkpoint
record's nextOid value seem fine to me.

- Heikki

Pavan Deolasee

pavan.deolasee@gmail.com

almost 8 years ago

In reply to: Heikki Linnakangas (#3)

Re: Bugs in TOAST handling, OID assignment and redo recovery

Hi Heikki,

On Tue, Apr 10, 2018 at 7:07 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

It would seem more straightforward to add a snapshot parameter to
GetNewOidWithIndex(), so that the this one caller could pass SnapshotToast,
while others pass SnapshotDirty. With your patch, you check the index
twice: first with SnapshotDirty, in GetNewOidWithIndex(), and then with
SnapshotToast, in the caller.

Hmm. I actually wrote my first patch exactly like that. I am trying to
remember why I discarded that approach. Was that to do with the fact
that SnapshotToast
can't see all toast tuples either? Yeah, I think so. For example, it won't
see tuples with uncommitted-xmin, leading to different issues. Now it's
unlikely that we will have a OID conflict where the old tuple has
uncommitted-xmin, but not sure if we can completely rule that out. For
example, if we did not uncover the redo recovery bug, we could have had a
prepared transaction having inserted the old tuple, which now conflicts
with new tuple and not detected by SnapshotToast.

If I'm reading the rewrite case correctly, it's a bit different and quite
special. In the loop with GetNewOidWithIndex(), it needs to check that the
OID is unused in two tables, the old TOAST table, and the new one. You can
only pass one index to GetNewOidWithIndex(), so it needs to check the
second index manually. It's not because of the snapshot issue. Although I
wonder if we should be using SnapshotToast in that GetNewOidWithIndex()
call, too. I.e. if we should be checking both the old and the new toast
table with SnapshotToast.

As I said, I am not sure if checking with just SnapshotToast is enough
because it can't see "dirty" tuples.

Agreed. With nextXid, we advance ShmemVariableCache->nextXid if the value
in the online checkpoint record is greater than
ShmemVariableCache->nextXid. But we don't have such a wraparound-aware
concept of "greater than" for OIDs. Ignoring the online checkpoint record's
nextOid value seem fine to me.

Ok. Thanks for checking.

Thanks,
Pavan

--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Pavan Deolasee

pavan.deolasee@gmail.com

almost 8 years ago

In reply to: Pavan Deolasee (#4)

Re: Bugs in TOAST handling, OID assignment and redo recovery

On Tue, Apr 10, 2018 at 7:24 PM, Pavan Deolasee <pavan.deolasee@gmail.com>
wrote:

Hi Heikki,

On Tue, Apr 10, 2018 at 7:07 PM, Heikki Linnakangas <hlinnaka@iki.fi>
wrote:

It would seem more straightforward to add a snapshot parameter to
GetNewOidWithIndex(), so that the this one caller could pass SnapshotToast,
while others pass SnapshotDirty. With your patch, you check the index
twice: first with SnapshotDirty, in GetNewOidWithIndex(), and then with
SnapshotToast, in the caller.

Hmm. I actually wrote my first patch exactly like that. I am trying to
remember why I discarded that approach. Was that to do with the fact that SnapshotToast
can't see all toast tuples either? Yeah, I think so. For example, it won't
see tuples with uncommitted-xmin, leading to different issues. Now it's
unlikely that we will have a OID conflict where the old tuple has
uncommitted-xmin, but not sure if we can completely rule that out.

Or may be we simply err on the side of caution and scan the toast table
with SnapshotAny while looking for a duplicate? That might prevent us from
reusing an OID for a known-dead tuple, but should save us a second index
scan and still work.

Thanks,
Pavan

--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Tom Lane

tgl@sss.pgh.pa.us

almost 8 years ago

In reply to: Pavan Deolasee (#5)

Re: Bugs in TOAST handling, OID assignment and redo recovery

Pavan Deolasee <pavan.deolasee@gmail.com> writes:

Or may be we simply err on the side of caution and scan the toast table
with SnapshotAny while looking for a duplicate? That might prevent us from
reusing an OID for a known-dead tuple, but should save us a second index
scan and still work.

+1. We really don't want to expend two indexscans on this.

I was worried about changing the signature of GetNewOidWithIndex in
a back-patched fix, but after looking around I think that's probably
safe. External callers really shouldn't be using anything lower-level
than GetNewOid.

regards, tom lane

Pavan Deolasee

pavan.deolasee@gmail.com

almost 8 years ago

In reply to: Tom Lane (#6)

2 attachment(s)

Re: Bugs in TOAST handling, OID assignment and redo recovery

On Wed, Apr 11, 2018 at 8:20 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Pavan Deolasee <pavan.deolasee@gmail.com> writes:

Or may be we simply err on the side of caution and scan the toast table
with SnapshotAny while looking for a duplicate? That might prevent us

from

reusing an OID for a known-dead tuple, but should save us a second index
scan and still work.

+1. We really don't want to expend two indexscans on this.

Ok. I propose attached patches, more polished this time. I also
changed toastrel_valueid_exists() to use SnapshotAny, but I don't have any
proofs to claim that's a needed change. But it seems harmless and given the
number of toast related, hard to chase bugs we have seen over the years,
may it's worth making it full proof too.

Thanks,
Pavan

--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

0002-Do-more-extensive-search-while-looking-for-duplicate.patchapplication/octet-stream; name=0002-Do-more-extensive-search-while-looking-for-duplicate.patchDownload

From 20ecc8bc74acc631c8714d2cec08008d86ace076 Mon Sep 17 00:00:00 2001
From: Pavan Deolasee <pavan.deolasee@gmail.com>
Date: Wed, 11 Apr 2018 23:49:48 +0530
Subject: [PATCH 2/2] Do more extensive search while looking for duplicate OID
 in a toast table
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

We now use SnapshotAny while searching the toast index/table for a potential
duplicate OID. This is necessary because SnapshotDirty, as the code was
previously using, fails to see RECENTLY_DEAD tuples and thus had a risk of
inserting duplicate OIDs that are not visible to MVCC snapshot but are visible
to SnapshotToast, which is used to scan toast tables.

Per multiple reports over the last many years, the most recent being from Adam
Sjøgren and one of 2ndQuadrant's customers.

This should be back patched to all supported releases.
---
 src/backend/access/heap/tuptoaster.c | 28 +++++++++++++++++++++-------
 src/backend/catalog/catalog.c        | 17 +++++++++--------
 src/include/catalog/catalog.h        |  3 ++-
 3 files changed, 32 insertions(+), 16 deletions(-)

diff --git a/src/backend/access/heap/tuptoaster.c b/src/backend/access/heap/tuptoaster.c
index 546f80f05c..3d00f6873a 100644
--- a/src/backend/access/heap/tuptoaster.c
+++ b/src/backend/access/heap/tuptoaster.c
@@ -1569,11 +1569,25 @@ toast_save_datum(Relation rel, Datum value,
 	 */
 	if (!OidIsValid(rel->rd_toastoid))
 	{
-		/* normal case: just choose an unused OID */
+		/*
+		 * Normal case: just choose an unused OID. But we must scan the TOAST
+		 * table using SnapshotAny to ensure that the value does not exists. We
+		 * must take into account RECENTLY_DEAD tuples since SnapshotToast can
+		 * see those tuples and we must take into account uncommitted tuples
+		 * since they may become LIVE in due time. So SnapshotAny is the right
+		 * thing to use.
+		 *
+		 * NB: There was a long standing bug where we used to scan the toast
+		 * table with SnapshotDirty and thus failing to note conflicting OIDs
+		 * in RECENTLY_DEAD tuples. toast_fetch_datum() and friends then would
+		 * return duplicate tuples, leading to various errors while reading
+		 * from the toast tuple.
+		 */
 		toast_pointer.va_valueid =
 			GetNewOidWithIndex(toastrel,
 							   RelationGetRelid(toastidxs[validIndex]),
-							   (AttrNumber) 1);
+							   (AttrNumber) 1,
+							   SnapshotAny);
 	}
 	else
 	{
@@ -1627,7 +1641,8 @@ toast_save_datum(Relation rel, Datum value,
 				toast_pointer.va_valueid =
 					GetNewOidWithIndex(toastrel,
 									   RelationGetRelid(toastidxs[validIndex]),
-									   (AttrNumber) 1);
+									   (AttrNumber) 1,
+									   SnapshotAny);
 			} while (toastid_valueid_exists(rel->rd_toastoid,
 											toast_pointer.va_valueid));
 		}
@@ -1806,7 +1821,6 @@ toastrel_valueid_exists(Relation toastrel, Oid valueid)
 	int			num_indexes;
 	int			validIndex;
 	Relation   *toastidxs;
-	SnapshotData SnapshotToast;
 
 	/* Fetch a valid index relation */
 	validIndex = toast_open_indexes(toastrel,
@@ -1823,12 +1837,12 @@ toastrel_valueid_exists(Relation toastrel, Oid valueid)
 				ObjectIdGetDatum(valueid));
 
 	/*
-	 * Is there any such chunk?
+	 * Is there any such chunk? Scan using SnapshotAny to ensure we do not miss
+	 * any tuple, LIVE or RECENTLY_DEAD.
 	 */
-	init_toast_snapshot(&SnapshotToast);
 	toastscan = systable_beginscan(toastrel,
 								   RelationGetRelid(toastidxs[validIndex]),
-								   true, &SnapshotToast, 1, &toastkey);
+								   true, SnapshotAny, 1, &toastkey);
 
 	if (systable_getnext(toastscan) != NULL)
 		result = true;
diff --git a/src/backend/catalog/catalog.c b/src/backend/catalog/catalog.c
index 809749add9..afff78e2f7 100644
--- a/src/backend/catalog/catalog.c
+++ b/src/backend/catalog/catalog.c
@@ -288,7 +288,8 @@ IsSharedRelation(Oid relationId)
 Oid
 GetNewOid(Relation relation)
 {
-	Oid			oidIndex;
+	Oid				oidIndex;
+	SnapshotData	SnapshotDirty;
 
 	/* If relation doesn't have OIDs at all, caller is confused */
 	Assert(relation->rd_rel->relhasoids);
@@ -315,8 +316,11 @@ GetNewOid(Relation relation)
 		return GetNewObjectId();
 	}
 
+	InitDirtySnapshot(SnapshotDirty);
+
 	/* Otherwise, use the index to find a nonconflicting OID */
-	return GetNewOidWithIndex(relation, oidIndex, ObjectIdAttributeNumber);
+	return GetNewOidWithIndex(relation, oidIndex, ObjectIdAttributeNumber,
+							  &SnapshotDirty);
 }
 
 /*
@@ -333,10 +337,10 @@ GetNewOid(Relation relation)
  * Caller must have a suitable lock on the relation.
  */
 Oid
-GetNewOidWithIndex(Relation relation, Oid indexId, AttrNumber oidcolumn)
+GetNewOidWithIndex(Relation relation, Oid indexId, AttrNumber oidcolumn,
+				   Snapshot snapshot)
 {
 	Oid			newOid;
-	SnapshotData SnapshotDirty;
 	SysScanDesc scan;
 	ScanKeyData key;
 	bool		collides;
@@ -349,8 +353,6 @@ GetNewOidWithIndex(Relation relation, Oid indexId, AttrNumber oidcolumn)
 	 */
 	Assert(!IsBinaryUpgrade || RelationGetRelid(relation) != TypeRelationId);
 
-	InitDirtySnapshot(SnapshotDirty);
-
 	/* Generate new OIDs until we find one not in the table */
 	do
 	{
@@ -364,8 +366,7 @@ GetNewOidWithIndex(Relation relation, Oid indexId, AttrNumber oidcolumn)
 					ObjectIdGetDatum(newOid));
 
 		/* see notes above about using SnapshotDirty */
-		scan = systable_beginscan(relation, indexId, true,
-								  &SnapshotDirty, 1, &key);
+		scan = systable_beginscan(relation, indexId, true, snapshot, 1, &key);
 
 		collides = HeapTupleIsValid(systable_getnext(scan));
 
diff --git a/src/include/catalog/catalog.h b/src/include/catalog/catalog.h
index 197e77f7f4..b5338646f2 100644
--- a/src/include/catalog/catalog.h
+++ b/src/include/catalog/catalog.h
@@ -16,6 +16,7 @@
 
 #include "catalog/pg_class.h"
 #include "utils/relcache.h"
+#include "utils/snapshot.h"
 
 
 extern bool IsSystemRelation(Relation relation);
@@ -35,7 +36,7 @@ extern bool IsSharedRelation(Oid relationId);
 
 extern Oid	GetNewOid(Relation relation);
 extern Oid GetNewOidWithIndex(Relation relation, Oid indexId,
-				   AttrNumber oidcolumn);
+				   AttrNumber oidcolumn, Snapshot snapshot);
 extern Oid GetNewRelFileNode(Oid reltablespace, Relation pg_class,
 				  char relpersistence);
 
-- 
2.14.3 (Apple Git-98)

0001-Do-not-overwrite-the-nextOid-counter-while-replaying.patchapplication/octet-stream; name=0001-Do-not-overwrite-the-nextOid-counter-while-replaying.patchDownload

From bb906c85b1ac96e3ae2b9cda3edd50d6d979b442 Mon Sep 17 00:00:00 2001
From: Pavan Deolasee <pavan.deolasee@gmail.com>
Date: Wed, 11 Apr 2018 22:56:03 +0530
Subject: [PATCH 1/2] Do not overwrite the nextOid counter while replaying
 ONLINE checkpoint

This is a long standing bug in the replay of ONLINE checkpoint record which may
lead to duplicate OID assignment post a crash recovery or standby promotion.
For example, if may have a situation where an ONLINE checkpoint begins,we then
consume all cached OIDs, allocate a new set of OIDs and generate a XLOG_NEXTOID
record and consume more OIDs. The ONLINE checkpoint then ends and we write a
checkpoint record. This checkpoint record will now have OID counter less than
the OIDs written to the disk.

For that reason, in case of a crash, followed by REDO recovery recovery or
while replaying the ONLINE checkpoint on the standby, we shouldn't overwrite
the current nextOid counter with what's stored in the checkpoint record.

This bug prominently brought out another bug in the TOAST and should be
backpatched to all supported releases.
---
 src/backend/access/transam/xlog.c | 28 +++++++++++++++++++++++-----
 1 file changed, 23 insertions(+), 5 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index a51df6b0b9..ab1e209cd3 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -9785,11 +9785,29 @@ xlog_redo(XLogReaderState *record)
 								  checkPoint.nextXid))
 			ShmemVariableCache->nextXid = checkPoint.nextXid;
 		LWLockRelease(XidGenLock);
-		/* ... but still treat OID counter as exact */
-		LWLockAcquire(OidGenLock, LW_EXCLUSIVE);
-		ShmemVariableCache->nextOid = checkPoint.nextOid;
-		ShmemVariableCache->oidCount = 0;
-		LWLockRelease(OidGenLock);
+
+		/*
+		 * In an ONLINE checkpoint, don't update the nextOid counter. When the
+		 * REDO recovery began, we must have initialized nextOid counter to the
+		 * value stored in the REDO checkpoint record. If there was no
+		 * XLOG_NEXTOID record replayed since the REDO checkpoint, then we know
+		 * that the current value is correct. If we replayed a XLOG_NEXTOID
+		 * record, then we should rather use the value advanced by the
+		 * XLOG_NEXTOID and not overwrite with a potentially old value, thus
+		 * issuing duplicate OIDs when we are fully up.
+		 *
+		 * If you're still curious why that can happen, note that the ONLINE
+		 * checkpoint may have started before we consumed all the cached OIDs
+		 * and requested another block of OIDs, generating a XLOG_NEXTOID
+		 * record. So the value noted in the checkpoint record would be stale.
+		 *
+		 * Note that since we don't have a concept of wrap-around detection for
+		 * OIDs, like we have for XIDs, we can't merely check if the value
+		 * stored in the ONLINE checkpoint record is greater than the current
+		 * value and apply the change only in that case like we do for nextXid.
+		 */
+
+		/* Handle multixact */
 		MultiXactAdvanceNextMXact(checkPoint.nextMulti,
 								  checkPoint.nextMultiOffset);
 
-- 
2.14.3 (Apple Git-98)

Tom Lane

tgl@sss.pgh.pa.us

almost 8 years ago

In reply to: Pavan Deolasee (#7)

Re: Bugs in TOAST handling, OID assignment and redo recovery

Pavan Deolasee <pavan.deolasee@gmail.com> writes:

Ok. I propose attached patches, more polished this time.

I'll take these, unless some other committer is hot to do so?

regards, tom lane

Simon Riggs

simon@2ndquadrant.com

almost 8 years ago

In reply to: Tom Lane (#8)

Re: Bugs in TOAST handling, OID assignment and redo recovery

On 11 April 2018 at 19:57, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Pavan Deolasee <pavan.deolasee@gmail.com> writes:

Ok. I propose attached patches, more polished this time.

I'll take these, unless some other committer is hot to do so?

Please go ahead.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#10

Tom Lane

tgl@sss.pgh.pa.us

almost 8 years ago

In reply to: Tom Lane (#8)

Re: Bugs in TOAST handling, OID assignment and redo recovery

So while looking at this, it suddenly occurred to me that probing with
SnapshotDirty isn't that safe for regular (non-TOAST) Oid assignment
either. SnapshotDirty will consider a row dead the instant the
deleting transaction has committed, but it may remain visible to other
transactions for awhile after that --- and now that we use MVCC for
catalog scans, that applies to them too. Hence, the existing logic
is capable of creating objects with transiently-conflicting OIDs.
I don't think this could create a conflict that's visible outside
our own transaction, since anyone who can see our commit would also
see the commit of the deleting transaction. But there's definitely
a hazard inside the transaction that creates a new object.

I propose therefore that the right fix does not require an API change
for GetNewOidWithIndex: we can just make it use SnapshotAny all the
time.

regards, tom lane

#11

Michael Paquier

michael@paquier.xyz

almost 8 years ago

In reply to: Tom Lane (#10)

Re: Bugs in TOAST handling, OID assignment and redo recovery

On Wed, Apr 11, 2018 at 04:28:33PM -0400, Tom Lane wrote:

I propose therefore that the right fix does not require an API change
for GetNewOidWithIndex: we can just make it use SnapshotAny all the
time.

I have not really checked this thread in details, but one thing that
strikes me is that it would be rather easy to add a TAP test based on
the initial script that Pavan has sent. Would that be worth testing
cycles or not? We do have tests like 008_fsm_truncation.pl which test
similar failure scenarios, so I think that we should have one.
--
Michael

#12

Tom Lane

tgl@sss.pgh.pa.us

almost 8 years ago

In reply to: Michael Paquier (#11)

Re: Bugs in TOAST handling, OID assignment and redo recovery

Michael Paquier <michael@paquier.xyz> writes:

I have not really checked this thread in details, but one thing that
strikes me is that it would be rather easy to add a TAP test based on
the initial script that Pavan has sent. Would that be worth testing
cycles or not?

I doubt it --- that test is so specialized that it'd be unlikely to
catch any other bug.

An example of the sort of bug that could exist in the same area would
be if we'd failed to note the places in tuptoaster.c that also need to
be switched to SnapshotAny. Only a test covering heap-rewrite could
have found that, which this test case surely doesn't.

What I think would actually be worth spending some time on is taking
a hard look at the remaining uses of SnapshotDirty to see if any of
them look squishy.

regards, tom lane

#13

Peter Eisentraut

peter.eisentraut@2ndquadrant.com

almost 8 years ago

In reply to: Pavan Deolasee (#1)

1 attachment(s)

Re: Bugs in TOAST handling, OID assignment and redo recovery

On 4/10/18 06:29, Pavan Deolasee wrote:

One of our 2ndQuadrant support customers recently reported a sudden rush
of TOAST errors post a crash recovery, nearly causing an outage. Most
errors read like this:

ERROR: unexpected chunk number 0 (expected 1) for toast value nnnn

While researching this, I found that the terminology in this code is
quite inconsistent. It talks about chunks ids, chunk indexes, chunk
numbers, etc. seemingly interchangeably. The above error is actually
about the chunk_seq, not about the chunk_id, as one might think.

The attached patch is my attempt to clean this up a bit. Thoughts?

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

0001-Clarify-TOAST-terminology-in-variables-and-error-mes.patchtext/plain; charset=UTF-8; name=0001-Clarify-TOAST-terminology-in-variables-and-error-mes.patch; x-mac-creator=0; x-mac-type=0Download

From db7f5eacc23114f24230dd087837d94bcefc4c88 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut <peter_e@gmx.net>
Date: Wed, 11 Apr 2018 20:11:35 -0400
Subject: [PATCH] Clarify TOAST terminology in variables and error messages

---
 src/backend/access/heap/tuptoaster.c | 67 ++++++++++++++--------------
 1 file changed, 34 insertions(+), 33 deletions(-)

diff --git a/src/backend/access/heap/tuptoaster.c b/src/backend/access/heap/tuptoaster.c
index cd42c50b09..4113cb29d2 100644
--- a/src/backend/access/heap/tuptoaster.c
+++ b/src/backend/access/heap/tuptoaster.c
@@ -1882,10 +1882,10 @@ toast_fetch_datum(struct varlena *attr)
 	struct varlena *result;
 	struct varatt_external toast_pointer;
 	int32		ressize;
-	int32		residx,
-				nextidx;
+	int32		res_seq,
+				next_seq;
 	int32		numchunks;
-	Pointer		chunk;
+	Pointer		chunk_data;
 	bool		isnull;
 	char	   *chunkdata;
 	int32		chunksize;
@@ -1930,13 +1930,13 @@ toast_fetch_datum(struct varlena *attr)
 				ObjectIdGetDatum(toast_pointer.va_valueid));
 
 	/*
-	 * Read the chunks by index
+	 * Read the chunks by sequence number
 	 *
-	 * Note that because the index is actually on (valueid, chunkidx) we will
-	 * see the chunks in chunkidx order, even though we didn't explicitly ask
+	 * Note that because the index is actually on (chunk_id, chunk_seq) we will
+	 * see the chunks in chunk_seq order, even though we didn't explicitly ask
 	 * for it.
 	 */
-	nextidx = 0;
+	next_seq = 0;
 
 	init_toast_snapshot(&SnapshotToast);
 	toastscan = systable_beginscan_ordered(toastrel, toastidxs[validIndex],
@@ -1946,20 +1946,20 @@ toast_fetch_datum(struct varlena *attr)
 		/*
 		 * Have a chunk, extract the sequence number and the data
 		 */
-		residx = DatumGetInt32(fastgetattr(ttup, 2, toasttupDesc, &isnull));
+		res_seq = DatumGetInt32(fastgetattr(ttup, 2, toasttupDesc, &isnull));
 		Assert(!isnull);
-		chunk = DatumGetPointer(fastgetattr(ttup, 3, toasttupDesc, &isnull));
+		chunk_data = DatumGetPointer(fastgetattr(ttup, 3, toasttupDesc, &isnull));
 		Assert(!isnull);
-		if (!VARATT_IS_EXTENDED(chunk))
+		if (!VARATT_IS_EXTENDED(chunk_data))
 		{
-			chunksize = VARSIZE(chunk) - VARHDRSZ;
-			chunkdata = VARDATA(chunk);
+			chunksize = VARSIZE(chunk_data) - VARHDRSZ;
+			chunkdata = VARDATA(chunk_data);
 		}
-		else if (VARATT_IS_SHORT(chunk))
+		else if (VARATT_IS_SHORT(chunk_data))
 		{
 			/* could happen due to heap_form_tuple doing its thing */
-			chunksize = VARSIZE_SHORT(chunk) - VARHDRSZ_SHORT;
-			chunkdata = VARDATA_SHORT(chunk);
+			chunksize = VARSIZE_SHORT(chunk_data) - VARHDRSZ_SHORT;
+			chunkdata = VARDATA_SHORT(chunk_data);
 		}
 		else
 		{
@@ -1974,33 +1974,34 @@ toast_fetch_datum(struct varlena *attr)
 		/*
 		 * Some checks on the data we've found
 		 */
-		if (residx != nextidx)
-			elog(ERROR, "unexpected chunk number %d (expected %d) for toast value %u in %s",
-				 residx, nextidx,
+		if (res_seq != next_seq)
+			elog(ERROR, "unexpected chunk_seq %d (expected %d) for toast value %u in %s",
+				 res_seq, next_seq,
 				 toast_pointer.va_valueid,
 				 RelationGetRelationName(toastrel));
-		if (residx < numchunks - 1)
+
+		if (res_seq < numchunks - 1)
 		{
 			if (chunksize != TOAST_MAX_CHUNK_SIZE)
-				elog(ERROR, "unexpected chunk size %d (expected %d) in chunk %d of %d for toast value %u in %s",
+				elog(ERROR, "unexpected chunk size %d (expected %d) in chunk_seq %d of %d for toast value %u in %s",
 					 chunksize, (int) TOAST_MAX_CHUNK_SIZE,
-					 residx, numchunks,
+					 res_seq, numchunks,
 					 toast_pointer.va_valueid,
 					 RelationGetRelationName(toastrel));
 		}
-		else if (residx == numchunks - 1)
+		else if (res_seq == numchunks - 1)
 		{
-			if ((residx * TOAST_MAX_CHUNK_SIZE + chunksize) != ressize)
-				elog(ERROR, "unexpected chunk size %d (expected %d) in final chunk %d for toast value %u in %s",
+			if ((res_seq * TOAST_MAX_CHUNK_SIZE + chunksize) != ressize)
+				elog(ERROR, "unexpected chunk size %d (expected %d) in final chunk_seq %d for toast value %u in %s",
 					 chunksize,
-					 (int) (ressize - residx * TOAST_MAX_CHUNK_SIZE),
-					 residx,
+					 (int) (ressize - res_seq * TOAST_MAX_CHUNK_SIZE),
+					 res_seq,
 					 toast_pointer.va_valueid,
 					 RelationGetRelationName(toastrel));
 		}
 		else
-			elog(ERROR, "unexpected chunk number %d (out of range %d..%d) for toast value %u in %s",
-				 residx,
+			elog(ERROR, "unexpected chunk_seq number %d (out of range %d..%d) for toast value %u in %s",
+				 res_seq,
 				 0, numchunks - 1,
 				 toast_pointer.va_valueid,
 				 RelationGetRelationName(toastrel));
@@ -2008,19 +2009,19 @@ toast_fetch_datum(struct varlena *attr)
 		/*
 		 * Copy the data into proper place in our result
 		 */
-		memcpy(VARDATA(result) + residx * TOAST_MAX_CHUNK_SIZE,
+		memcpy(VARDATA(result) + res_seq * TOAST_MAX_CHUNK_SIZE,
 			   chunkdata,
 			   chunksize);
 
-		nextidx++;
+		next_seq++;
 	}
 
 	/*
 	 * Final checks that we successfully fetched the datum
 	 */
-	if (nextidx != numchunks)
-		elog(ERROR, "missing chunk number %d for toast value %u in %s",
-			 nextidx,
+	if (next_seq != numchunks)
+		elog(ERROR, "missing chunk_seq number %d for toast value %u in %s",
+			 next_seq,
 			 toast_pointer.va_valueid,
 			 RelationGetRelationName(toastrel));
 
-- 
2.17.0

#14

Pavan Deolasee

pavan.deolasee@gmail.com

almost 8 years ago

In reply to: Tom Lane (#10)

Re: Bugs in TOAST handling, OID assignment and redo recovery

On Thu, Apr 12, 2018 at 1:58 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

So while looking at this, it suddenly occurred to me that probing with
SnapshotDirty isn't that safe for regular (non-TOAST) Oid assignment
either.

Yeah it occurred to me as well, but when I looked at the code, I couldn't
find a case that is broken. I even tried a few test cases with DDLs etc.
But I think what you did is fine and more bullet proof. So +1 to that.

Thanks,
Pavan

--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#15

Pavan Deolasee

pavan.deolasee@gmail.com

almost 8 years ago

In reply to: Tom Lane (#12)

Re: Bugs in TOAST handling, OID assignment and redo recovery

On Thu, Apr 12, 2018 at 5:21 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Michael Paquier <michael@paquier.xyz> writes:

I have not really checked this thread in details, but one thing that
strikes me is that it would be rather easy to add a TAP test based on
the initial script that Pavan has sent. Would that be worth testing
cycles or not?

I doubt it --- that test is so specialized that it'd be unlikely to
catch any other bug.

I agree; it's not worth adding a TAP test except may be as a demonstration
to write future tests for catching such issues. This was a very specialised
test case written after getting a full grasp on the bug. And it just tests
the thing that I knew is broken based on code reading. Also, with OID
duplicate issue fixed, hitting more bugs in this area is going to be even
more difficult.

Thanks,
Pavan

--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#16

Tom Lane

tgl@sss.pgh.pa.us

almost 8 years ago

In reply to: Pavan Deolasee (#14)

Re: Bugs in TOAST handling, OID assignment and redo recovery

Pavan Deolasee <pavan.deolasee@gmail.com> writes:

On Thu, Apr 12, 2018 at 1:58 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

So while looking at this, it suddenly occurred to me that probing with
SnapshotDirty isn't that safe for regular (non-TOAST) Oid assignment
either.

Yeah it occurred to me as well, but when I looked at the code, I couldn't
find a case that is broken. I even tried a few test cases with DDLs etc.

I think it probably can't happen for catalog MVCC scans, because we
generally take full new snapshots for those. The situation that would be
hazardous is where we use an existing snapshot (so that it doesn't see
the just-committed Other Transaction) but advance its command counter
(so that it can see the new object we just made). So the sort of failure
I'd predict is that a user query just after an object creation could see
duplicate OIDs in the catalogs. To get to that point, you might need a
stable function (using the troublesome snapshot) calling a volatile one
(which contains the actual DDL).

But I think what you did is fine and more bullet proof. So +1 to that.

Yeah, even if this isn't actually a reachable case, it's hard to argue
that we should complicate the code and take API-break risks in the back
branches to make a tiny optimization on the assumption that it can never
happen.

regards, tom lane

#17

Pavan Deolasee

pavan.deolasee@gmail.com

over 7 years ago

In reply to: Peter Eisentraut (#13)

Re: Bugs in TOAST handling, OID assignment and redo recovery

On Thu, Apr 12, 2018 at 5:53 AM, Peter Eisentraut <
peter.eisentraut@2ndquadrant.com> wrote:

On 4/10/18 06:29, Pavan Deolasee wrote:

One of our 2ndQuadrant support customers recently reported a sudden rush
of TOAST errors post a crash recovery, nearly causing an outage. Most
errors read like this:

ERROR: unexpected chunk number 0 (expected 1) for toast value nnnn

While researching this, I found that the terminology in this code is
quite inconsistent. It talks about chunks ids, chunk indexes, chunk
numbers, etc. seemingly interchangeably. The above error is actually
about the chunk_seq, not about the chunk_id, as one might think.

The attached patch is my attempt to clean this up a bit. Thoughts?

While I agree that we should clean it up, I wonder if changing error text
would be a good idea. These errors are being reported by a very long time
and if we change the text, we might forget the knowledge about the past
reports.

Also, "toast value" is same as "chunk_id". Should we clean up something
there too? "chunk_seq number" -- should that be just "chunk_seq"?

Thanks,
Pavan

--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#18

Bruce Momjian

bruce@momjian.us

over 7 years ago

In reply to: Pavan Deolasee (#17)

Re: Bugs in TOAST handling, OID assignment and redo recovery

On Wed, Apr 18, 2018 at 12:07:52PM +0530, Pavan Deolasee wrote:

On Thu, Apr 12, 2018 at 5:53 AM, Peter Eisentraut <
peter.eisentraut@2ndquadrant.com> wrote:

On 4/10/18 06:29, Pavan Deolasee wrote:

One of our 2ndQuadrant support customers recently reported a sudden rush
of TOAST errors post a crash recovery, nearly causing an outage. Most
errors read like this:

ERROR: unexpected chunk number 0 (expected 1) for toast value nnnn

While researching this, I found that the terminology in this code is
quite inconsistent.ï¿½ It talks about chunks ids, chunk indexes, chunk
numbers, etc. seemingly interchangeably.ï¿½ The above error is actually
about the chunk_seq, not about the chunk_id, as one might think.

The attached patch is my attempt to clean this up a bit.ï¿½ Thoughts?

While I agree that we should clean it up, I wonder if changing error text would
be a good idea. These errors are being reported by a very long time and if we
change the text, we might forget the knowledge about the past reports.

Also, "toast value" is same as "chunk_id". Should we clean up something there
too? "chunk_seq number" -- should that be just "chunk_seq"?

We can put a comment next to the error message C string if we want to
keep historical knowledge.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +

#19

Peter Eisentraut

peter.eisentraut@2ndquadrant.com

over 7 years ago

In reply to: Pavan Deolasee (#17)

Re: Bugs in TOAST handling, OID assignment and redo recovery

On 4/18/18 02:37, Pavan Deolasee wrote:

While I agree that we should clean it up, I wonder if changing error
text would be a good idea. These errors are being reported by a very
long time and if we change the text, we might forget the knowledge about
the past reports.

It's presumably fixed now, so if we see the new error message, we'll
know it's a different issue.

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#20

Alvaro Herrera

alvherre@2ndquadrant.com

over 7 years ago

In reply to: Peter Eisentraut (#13)

Re: Bugs in TOAST handling, OID assignment and redo recovery

On 2018-Apr-11, Peter Eisentraut wrote:

On 4/10/18 06:29, Pavan Deolasee wrote:

One of our 2ndQuadrant support customers recently reported a sudden rush
of TOAST errors post a crash recovery, nearly causing an outage. Most
errors read like this:

ERROR: unexpected chunk number 0 (expected 1) for toast value nnnn

While researching this, I found that the terminology in this code is
quite inconsistent. It talks about chunks ids, chunk indexes, chunk
numbers, etc. seemingly interchangeably. The above error is actually
about the chunk_seq, not about the chunk_id, as one might think.

The attached patch is my attempt to clean this up a bit. Thoughts?

I didn't review your patch carefully, but +1 on the idea. By all means
let's not forget this for pg12.

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services