Toast issues with OldestXmin going backwards

Started by Andrew Gierthover 7 years ago30 messages

andrew@tao11.riddles.org.uk

over 7 years ago

Various comments in GetOldestXmin mention the possibility of the oldest
xmin going backward, and assert that this is actually safe. It's not.

Consider:

A table has a toastable column. A row is updated in a way that changes
the toasted value. There are now two row versions pointing to different
toast values, one live and one dead.

Now suppose the toast table - but not the main table - is vacuumed; the
dead toast entries are removed, even though they are still referenced by
the dead main-table row. Autovacuum treats the main table and toast
table separately, so this can happen.

Now suppose that OldestXmin goes backwards so that the older main table
row version is no longer dead, but merely recently-dead.

At this point, VACUUM FULL (or similar rewrites) on the table will fail
with "missing chunk number 0 for ..." toast errors, because it tries to
copy the recently-dead row, but that row's toasted values have been
vacuumed away already.

(I've been working for a while with someone on IRC to track down the
source of unexplained semi-repeatable "missing chunk number 0 ..."
errors from VACUUM FULL. I don't know at this stage whether this is the
actual problem, but it matches the symptoms.)

--
Andrew (irc:RhodiumToad)

Amit Kapila

amit.kapila16@gmail.com

over 7 years ago

In reply to: Andrew Gierth (#1)

Re: Toast issues with OldestXmin going backwards

On Thu, Apr 19, 2018 at 4:07 PM, Andrew Gierth
<andrew@tao11.riddles.org.uk> wrote:

Various comments in GetOldestXmin mention the possibility of the oldest
xmin going backward, and assert that this is actually safe. It's not.

Consider:

A table has a toastable column. A row is updated in a way that changes
the toasted value. There are now two row versions pointing to different
toast values, one live and one dead.

Now suppose the toast table - but not the main table - is vacuumed; the
dead toast entries are removed, even though they are still referenced by
the dead main-table row. Autovacuum treats the main table and toast
table separately, so this can happen.

Now suppose that OldestXmin goes backwards so that the older main table
row version is no longer dead, but merely recently-dead.

At this point, VACUUM FULL (or similar rewrites) on the table will fail
with "missing chunk number 0 for ..." toast errors, because it tries to
copy the recently-dead row, but that row's toasted values have been
vacuumed away already.

I haven't tried to reproduce it, but I can see the possibility of the
problem described by you. What should we do next? I could see few
possibilities: (a) Vacuum for main and toast table should always use
same OldestXmin, (b) Before removing the row from toast table, we
should ensure that row in the main table is removed, (c) Change the
logic during rewrite such that it can detect this situation and skip
copying the tuple in the main table, on a quick look, this seems
tricky, because the toast table TID might have been reused by that
time, basically I am not sure if this can be a viable solution or (d)
Ensure that GetOldestXmin doesn't move backwards or write a new API
similar to it which doesn't allow OldestXmin to move backwards and use
that for the purpose of vacuum.

Any better ideas?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Andrew Gierth

andrew@tao11.riddles.org.uk

over 7 years ago

In reply to: Amit Kapila (#2)

Re: Toast issues with OldestXmin going backwards

"Amit" == Amit Kapila <amit.kapila16@gmail.com> writes:

Amit> I haven't tried to reproduce it, but I can see the possibility of
Amit> the problem described by you. What should we do next? I could see
Amit> few possibilities: (a) Vacuum for main and toast table should
Amit> always use same OldestXmin,

I don't see how this could be arranged without either disabling the
ability to vacuum the toast table independently, or storing the xmin
somewhere.

Amit> (b) Before removing the row from toast table, we should ensure
Amit> that row in the main table is removed,

We can't find the main table row version(s) without scanning the whole
main table, so this amounts (again) to disabling vacuum on the toast
table only.

Amit> (c) Change the logic during rewrite such that it can detect this
Amit> situation and skip copying the tuple in the main table,

This seems more promising, but the problem is how to detect the error
safely (since currently, it's just ereport()ed from deep inside the
toast code). How about:

1) add a toast_datum_exists() call or similar that returns false if the
(first block of) the specified external toast item is missing

2) when copying a RECENTLY_DEAD tuple, check all its external column
values using this function beforehand, and if any are missing, treat the
tuple as DEAD instead.

Amit> on a quick look, this seems tricky, because the toast table TID
Amit> might have been reused by that time,

Toast pointers don't point to TIDs fortunately, they point to OIDs. The
OID could have been reused (if there's been an OID wraparound since the
toast item was created), but that should be harmless: it means that we'd
incorrectly copy the new value with the old tuple, but the old tuple is
never going to be visible to anybody ever again so nothing will see that.

Amit> or (d) Ensure that GetOldestXmin doesn't move backwards or write
Amit> a new API similar to it which doesn't allow OldestXmin to move
Amit> backwards and use that for the purpose of vacuum.

This would presumably require storing a reference OldestXmin somewhere
(in shared memory? but there'd have to be a global one and one per db,
and I don't think we have any shared memory structures at present that
are per-db). We'd even have to preserve an oldestXmin for databases with
no currently active connections, though we wouldn't have to preserve it
across restarts.

Amit> Any better ideas?

It turns out this issue has come up before (5 years ago!):

/messages/by-id/20362.1359747327@sss.pgh.pa.us

--
Andrew (irc:RhodiumToad)

Andrew Gierth

andrew@tao11.riddles.org.uk

over 7 years ago

In reply to: Amit Kapila (#2)

1 attachment(s)

Re: Toast issues with OldestXmin going backwards

"Amit" == Amit Kapila <amit.kapila16@gmail.com> writes:

Amit> (c) Change the logic during rewrite such that it can detect this
Amit> situation and skip copying the tuple in the main table,

So I dug into this one and it looks to me like the best approach. Here's
a draft patch against 10-stable, if nobody can think of any showstoppers
then I'll do the rest of the versions and commit them.

--
Andrew (irc:RhodiumToad)

Attachments:

toastvac.patchtext/x-patchDownload

diff --git a/src/backend/access/heap/tuptoaster.c b/src/backend/access/heap/tuptoaster.c
index e3d09db0a8..fd091f2de5 100644
--- a/src/backend/access/heap/tuptoaster.c
+++ b/src/backend/access/heap/tuptoaster.c
@@ -1455,6 +1455,50 @@ toast_get_valid_index(Oid toastoid, LOCKMODE lock)
 
 
 /* ----------
+ * toast_validate_tuple -
+ *
+ *  Given one HeapTuple, test whether all of its external toast rows (if any)
+ *  exist and are the correct length.
+ *
+ * tup: tuple
+ * tupleDesc: tupdesc
+ */
+bool
+toast_validate_tuple(HeapTuple tup, TupleDesc tupleDesc)
+{
+	Form_pg_attribute *att = tupleDesc->attrs;
+	int			numAttrs = tupleDesc->natts;
+	int			i;
+	Datum		toast_values[MaxTupleAttributeNumber];
+	bool		toast_isnull[MaxTupleAttributeNumber];
+
+	/*
+	 * Caller can check this if they like, but it doesn't hurt to do it here
+	 * too.
+	 */
+	if (!HeapTupleHasExternal(tup))
+		return true;
+
+	/*
+	 * Break down the tuple into fields.
+	 */
+	Assert(numAttrs <= MaxTupleAttributeNumber);
+	heap_deform_tuple(tup, tupleDesc, toast_values, toast_isnull);
+
+	for (i = 0; i < numAttrs; i++)
+	{
+		if (!toast_isnull[i] &&
+			att[i]->attlen == -1 &&
+			!att[i]->attisdropped &&
+			!toast_validate_datum(toast_values[i]))
+			return false;
+	}
+
+	return true;
+}
+
+
+/* ----------
  * toast_save_datum -
  *
  *	Save one single datum into the secondary relation and return
@@ -2037,6 +2081,141 @@ toast_fetch_datum(struct varlena *attr)
 }
 
 /* ----------
+ * toast_validate_datum -
+ *
+ *  Test whether a Datum's external toast rows (if any) exist with the correct
+ *  lengths.
+ *
+ *  This should track toast_fetch_datum to the extent that it returns false if
+ *  toast_fetch_datum would have errored out due to missing chunks or incorrect
+ *  lengths. (Other kinds of "can't happen" errors can still be thrown.)
+ * ----------
+ */
+bool
+toast_validate_datum(Datum value)
+{
+	bool		result = true;
+	Relation	toastrel;
+	Relation   *toastidxs;
+	ScanKeyData toastkey;
+	SysScanDesc toastscan;
+	HeapTuple	ttup;
+	TupleDesc	toasttupDesc;
+	struct varlena *attr = (struct varlena *) DatumGetPointer(value);
+	struct varatt_external toast_pointer;
+	int32		ressize;
+	int32		residx,
+				nextidx;
+	int32		numchunks;
+	Pointer		chunk;
+	bool		isnull;
+	int32		chunksize;
+	int			num_indexes;
+	int			validIndex;
+	SnapshotData SnapshotToast;
+
+	if (!VARATT_IS_EXTERNAL_ONDISK(attr))
+		return true;
+
+	/* Must copy to access aligned fields */
+	VARATT_EXTERNAL_GET_POINTER(toast_pointer, attr);
+
+	ressize = toast_pointer.va_extsize;
+	numchunks = ((ressize - 1) / TOAST_MAX_CHUNK_SIZE) + 1;
+
+	/*
+	 * Open the toast relation and its indexes
+	 */
+	toastrel = heap_open(toast_pointer.va_toastrelid, AccessShareLock);
+	toasttupDesc = toastrel->rd_att;
+
+	/* Look for the valid index of the toast relation */
+	validIndex = toast_open_indexes(toastrel,
+									AccessShareLock,
+									&toastidxs,
+									&num_indexes);
+
+	/*
+	 * Setup a scan key to fetch from the index by va_valueid
+	 */
+	ScanKeyInit(&toastkey,
+				(AttrNumber) 1,
+				BTEqualStrategyNumber, F_OIDEQ,
+				ObjectIdGetDatum(toast_pointer.va_valueid));
+
+	/*
+	 * Read the chunks by index
+	 *
+	 * Note that because the index is actually on (valueid, chunkidx) we will
+	 * see the chunks in chunkidx order, even though we didn't explicitly ask
+	 * for it.
+	 */
+	nextidx = 0;
+
+	init_toast_snapshot(&SnapshotToast);
+	toastscan = systable_beginscan_ordered(toastrel, toastidxs[validIndex],
+										   &SnapshotToast, 1, &toastkey);
+	while ((ttup = systable_getnext_ordered(toastscan, ForwardScanDirection)) != NULL)
+	{
+		/*
+		 * Have a chunk, extract the sequence number and the data
+		 */
+		residx = DatumGetInt32(fastgetattr(ttup, 2, toasttupDesc, &isnull));
+		Assert(!isnull);
+		chunk = DatumGetPointer(fastgetattr(ttup, 3, toasttupDesc, &isnull));
+		Assert(!isnull);
+		if (!VARATT_IS_EXTENDED(chunk))
+		{
+			chunksize = VARSIZE(chunk) - VARHDRSZ;
+		}
+		else if (VARATT_IS_SHORT(chunk))
+		{
+			/* could happen due to heap_form_tuple doing its thing */
+			chunksize = VARSIZE_SHORT(chunk) - VARHDRSZ_SHORT;
+		}
+		else
+		{
+			/* should never happen */
+			elog(ERROR, "found toasted toast chunk for toast value %u in %s",
+				 toast_pointer.va_valueid,
+				 RelationGetRelationName(toastrel));
+			chunksize = 0;		/* keep compiler quiet */
+		}
+
+		/*
+		 * Some checks on the data we've found
+		 */
+		if (residx != nextidx ||
+			(residx < numchunks - 1 &&
+			 chunksize != TOAST_MAX_CHUNK_SIZE) ||
+			(residx == numchunks - 1 &&
+			 ((residx * TOAST_MAX_CHUNK_SIZE + chunksize) != ressize)) ||
+			(residx >= numchunks))
+		{
+			result = false;
+			break;
+		}
+
+		nextidx++;
+	}
+
+	/*
+	 * Final checks
+	 */
+	if (result && nextidx != numchunks)
+		result = false;
+
+	/*
+	 * End scan and close relations
+	 */
+	systable_endscan_ordered(toastscan);
+	toast_close_indexes(toastidxs, num_indexes, AccessShareLock);
+	heap_close(toastrel, AccessShareLock);
+
+	return result;
+}
+
+/* ----------
  * toast_fetch_datum_slice -
  *
  *	Reconstruct a segment of a Datum from the chunks saved
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index dfd089f63c..b241b67a5e 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -945,6 +945,7 @@ copy_heap_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex, bool verbose,
 		HeapTuple	tuple;
 		Buffer		buf;
 		bool		isdead;
+		bool		validate_toast = false;
 
 		CHECK_FOR_INTERRUPTS();
 
@@ -979,6 +980,7 @@ copy_heap_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex, bool verbose,
 				break;
 			case HEAPTUPLE_RECENTLY_DEAD:
 				tups_recently_dead += 1;
+				validate_toast = true;
 				/* fall through */
 			case HEAPTUPLE_LIVE:
 				/* Live or recently dead, must copy it */
@@ -1022,6 +1024,20 @@ copy_heap_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex, bool verbose,
 
 		LockBuffer(buf, BUFFER_LOCK_UNLOCK);
 
+		/*
+		 * OldestXmin going backwards can mean we have a RECENTLY_DEAD row that
+		 * contains toast pointers whose toast rows have already been vacuumed
+		 * away (or in the worst but unlikely case, recycled). If so, then the
+		 * row must really be dead to all snapshots that could access it, so
+		 * treat it as DEAD instead.
+		 */
+		if (validate_toast &&
+			!toast_validate_tuple(tuple, oldTupDesc))
+		{
+			isdead = true;
+			tups_recently_dead -= 1;
+		}
+
 		if (isdead)
 		{
 			tups_vacuumed += 1;
diff --git a/src/include/access/tuptoaster.h b/src/include/access/tuptoaster.h
index fd9f83ac44..83b59257ef 100644
--- a/src/include/access/tuptoaster.h
+++ b/src/include/access/tuptoaster.h
@@ -236,4 +236,22 @@ extern Size toast_datum_size(Datum value);
  */
 extern Oid	toast_get_valid_index(Oid toastoid, LOCKMODE lock);
 
+/* ----------
+ * toast_validate_tuple -
+ *
+ *  Given one HeapTuple, test whether all of its external toast rows (if any)
+ *  exist and are the correct length.
+ *  ----------
+ */
+extern bool toast_validate_tuple(HeapTuple tup, TupleDesc tupleDesc);
+
+/* ----------
+ * toast_validate_datum -
+ *
+ *  Test whether a Datum's external toast rows (if any) exist with the correct
+ *  lengths.
+ *  ----------
+ */
+extern bool toast_validate_datum(Datum value);
+
 #endif							/* TUPTOASTER_H */

Pavan Deolasee

pavan.deolasee@gmail.com

over 7 years ago

In reply to: Andrew Gierth (#4)

Re: Toast issues with OldestXmin going backwards

On Sat, 21 Apr 2018 at 6:46 PM, Andrew Gierth <andrew@tao11.riddles.org.uk>
wrote:

So I dug into this one and it looks to me like the best approach. Here's
a draft patch against 10-stable, if nobody can think of any showstoppers
then I'll do the rest of the versions and commit them.

How does this guard against the case when the same OID gets inserted in the
toast table again, with matching lengths etc? Rare but seems possible, no?

I think we should look for a more complete solution how hard these bugs are
to detect and fix.

Thanks,
Pavan

--
Andrew (irc:RhodiumToad)

--

Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Andres Freund

andres@anarazel.de

over 7 years ago

In reply to: Andrew Gierth (#4)

Re: Toast issues with OldestXmin going backwards

Hi,

On 2018-04-21 14:16:27 +0100, Andrew Gierth wrote:

"Amit" == Amit Kapila <amit.kapila16@gmail.com> writes:

Amit> (c) Change the logic during rewrite such that it can detect this
Amit> situation and skip copying the tuple in the main table,

So I dug into this one and it looks to me like the best approach. Here's
a draft patch against 10-stable, if nobody can think of any showstoppers
then I'll do the rest of the versions and commit them.

Please wait for a bit. This isn't a trivial change, and I would like to
wrap my head around it.

At the very least this seems like it could cause extreme slowdowns for
large tables that have the right update/delete patterns?

+		/*
+		 * OldestXmin going backwards can mean we have a RECENTLY_DEAD row that
+		 * contains toast pointers whose toast rows have already been vacuumed
+		 * away (or in the worst but unlikely case, recycled). If so, then the
+		 * row must really be dead to all snapshots that could access it, so
+		 * treat it as DEAD instead.
+		 */

How is it guaranteed that the recheck doesn't find a different toast
tuple at the relevant position?

Greetings,

Andres Freund

Andrew Gierth

andrew@tao11.riddles.org.uk

over 7 years ago

In reply to: Andres Freund (#6)

Re: Toast issues with OldestXmin going backwards

"Andres" == Andres Freund <andres@anarazel.de> writes:

Andres> Please wait for a bit. This isn't a trivial change, and I would
Andres> like to wrap my head around it.

Sure.

Andres> At the very least this seems like it could cause extreme
Andres> slowdowns for large tables that have the right update/delete
Andres> patterns?

I'm working on the basis that fetching the same set of toast rows twice
in succession will at least mean they are likely in cache for the second
time, so we shouldn't be doing any more actual I/O than before. (This is
assuming that a system with really big toast values is going to also
have plenty of RAM, which is a generally safe assumption since such
values are copied many times in memory by pg_dump.) There is still some
overhead in the check, of course; I will see about doing some
benchmarks.

+		/*
+		 * OldestXmin going backwards can mean we have a RECENTLY_DEAD row that
+		 * contains toast pointers whose toast rows have already been vacuumed
+		 * away (or in the worst but unlikely case, recycled). If so, then the
+		 * row must really be dead to all snapshots that could access it, so
+		 * treat it as DEAD instead.
+		 */

Andres> How is it guaranteed that the recheck doesn't find a different
Andres> toast tuple at the relevant position?

It's not, but it doesn't matter as long as the new toast rows can pass
through toast_fetch_datum without throwing an error (which is why
toast_validate_datum has to scan all the entries and check all their
lengths).

Firstly, we can assume the following statements (or at least if we
can't, we have far worse problems than this):

1. OldestXmin going backwards cannot make any tuple visible in any
(non-vacuum) snapshot that it wasn't visible in before;

2. If a tuple is non-vacuumable itself, it won't have its toast
entries vacuumed away;

3. Once a tuple is vacuumable at any single instant, it will never
become visible in any snapshot.

So the only cases where a recently-dead tuple can have pointers to a
now-recycled toast entry OID is if it was at some point vacuumable, such
that a toast-table-only vacuum removed its toast entries, and a new insert
or update subsequently inserted a value of the same length that was
assigned the OID.

Since the tuple was vacuumable at that point, it will never again become
visible in any snapshot, so the only code that should ever access its
toast values is vacuum full/cluster, which only ever copies the value
and does not attempt to decompress or otherwise interpret it. So the
only possible consequences for copying the wrong value, as long as we
don't throw an error in the process, are that (a) some space is possibly
wasted in the toast table - which actually happens already in far more
common circumstances - and (b) someone doing forensics or other
non-standard access to dead tuples might (in this vanishingly rare case)
see something unexpected. At no time would incorrect data be visible in
any query.

(regarding wasted space in the toast table: vacuum full / cluster will
insert live toast entries for copied recently-dead tuples, and if those
toast entries are not also referenced from any live tuple, they will
never be removed by subsequent normal vacuums, since nothing can ever
mark those entries dead; only a later vacuum full / cluster can remove
them.)

--
Andrew (irc:RhodiumToad)

Andrew Gierth

andrew@tao11.riddles.org.uk

over 7 years ago

In reply to: Andrew Gierth (#7)

Re: Toast issues with OldestXmin going backwards

"Andrew" == Andrew Gierth <andrew@tao11.riddles.org.uk> writes:

Andrew> There is still some overhead in the check, of course; I will
Andrew> see about doing some benchmarks.

Preliminary result suggests that the user-CPU cost of vacuum full
increases by ~16% when the entire table is recently-dead rows (with a
toasted column of ~10k length) compared to the same table when all rows
are live.

Since I doubt the practical value of vacuum full on a table which is
100% recently-dead, and that I would expect the live:recently-dead ratio
to not normally be much worse than 1:1 (making the overhead 8%) and more
likely up around 10:1 (making the overhead 1.5%), I think this is not an
issue. (When you have a lot of recently-dead rows is exactly the _wrong_
time to be doing a vacuum full, since it wouldn't save you any space and
you'd bloat the toast table as mentioned previously.)

--
Andrew (irc:RhodiumToad)

Amit Kapila

amit.kapila16@gmail.com

over 7 years ago

In reply to: Andrew Gierth (#3)

1 attachment(s)

Re: Toast issues with OldestXmin going backwards

On Sat, Apr 21, 2018 at 1:26 AM, Andrew Gierth
<andrew@tao11.riddles.org.uk> wrote:

"Amit" == Amit Kapila <amit.kapila16@gmail.com> writes:

Amit> I haven't tried to reproduce it, but I can see the possibility of
Amit> the problem described by you. What should we do next? I could see
Amit> few possibilities: (a) Vacuum for main and toast table should
Amit> always use same OldestXmin,

I don't see how this could be arranged without either disabling the
ability to vacuum the toast table independently, or storing the xmin
somewhere.

I think we can use the same xmin for both main heap and toast by
computing it before scanning the main heap (lazy_vacuum_rel) and then
pass it down. I have tried it and came up with the attached patch.

Amit> (b) Before removing the row from toast table, we should ensure
Amit> that row in the main table is removed,

We can't find the main table row version(s) without scanning the whole
main table, so this amounts (again) to disabling vacuum on the toast
table only.

Amit> (c) Change the logic during rewrite such that it can detect this
Amit> situation and skip copying the tuple in the main table,

This seems more promising, but the problem is how to detect the error
safely (since currently, it's just ereport()ed from deep inside the
toast code). How about:

1) add a toast_datum_exists() call or similar that returns false if the
(first block of) the specified external toast item is missing

2) when copying a RECENTLY_DEAD tuple, check all its external column
values using this function beforehand, and if any are missing, treat the
tuple as DEAD instead.

Amit> on a quick look, this seems tricky, because the toast table TID
Amit> might have been reused by that time,

Toast pointers don't point to TIDs fortunately, they point to OIDs. The
OID could have been reused (if there's been an OID wraparound since the
toast item was created), but that should be harmless: it means that we'd
incorrectly copy the new value with the old tuple, but the old tuple is
never going to be visible to anybody ever again so nothing will see that.

Yeah, that's right, but it gives some uneasy feeling that we are
attaching the wrong toast value. If you think there is some basic
flaw in Approach-1 (as in attached patch) or it requires some major
surgery then we can look further into this.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

use_same_oldestxmin_mainheap_toast_v1.patchapplication/octet-stream; name=use_same_oldestxmin_mainheap_toast_v1.patchDownload

diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index d088dc1..8b8b97f 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -765,7 +765,7 @@ copy_heap_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex, bool verbose,
 	HeapScanDesc heapScan;
 	bool		use_wal;
 	bool		is_system_catalog;
-	TransactionId OldestXmin;
+	TransactionId OldestXmin = InvalidTransactionId;
 	TransactionId FreezeXid;
 	MultiXactId MultiXactCutoff;
 	RewriteState rwstate;
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index d90cb9a..d745127 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -75,7 +75,7 @@ static void vac_truncate_clog(TransactionId frozenXID,
 				  TransactionId lastSaneFrozenXid,
 				  MultiXactId lastSaneMinMulti);
 static bool vacuum_rel(Oid relid, RangeVar *relation, int options,
-		   VacuumParams *params);
+		   VacuumParams *params, TransactionId oldestXmin);
 
 /*
  * Primary entry point for manual VACUUM and ANALYZE commands
@@ -337,7 +337,8 @@ vacuum(int options, List *relations, VacuumParams *params,
 
 			if (options & VACOPT_VACUUM)
 			{
-				if (!vacuum_rel(vrel->oid, vrel->relation, options, params))
+				if (!vacuum_rel(vrel->oid, vrel->relation, options, params,
+								InvalidTransactionId))
 					continue;
 			}
 
@@ -618,8 +619,9 @@ vacuum_set_xid_limits(Relation rel,
 	 * working on a particular table at any time, and that each vacuum is
 	 * always an independent transaction.
 	 */
-	*oldestXmin =
-		TransactionIdLimitedForOldSnapshots(GetOldestXmin(rel, PROCARRAY_FLAGS_VACUUM), rel);
+	if (!TransactionIdIsValid(*oldestXmin))
+		*oldestXmin =
+			TransactionIdLimitedForOldSnapshots(GetOldestXmin(rel, PROCARRAY_FLAGS_VACUUM), rel);
 
 	Assert(TransactionIdIsNormal(*oldestXmin));
 
@@ -1298,7 +1300,8 @@ vac_truncate_clog(TransactionId frozenXID,
  *		At entry and exit, we are not inside a transaction.
  */
 static bool
-vacuum_rel(Oid relid, RangeVar *relation, int options, VacuumParams *params)
+vacuum_rel(Oid relid, RangeVar *relation, int options, VacuumParams *params,
+		   TransactionId oldestXmin)
 {
 	LOCKMODE	lmode;
 	Relation	onerel;
@@ -1530,6 +1533,19 @@ vacuum_rel(Oid relid, RangeVar *relation, int options, VacuumParams *params)
 		toast_relid = InvalidOid;
 
 	/*
+	 * Use same OldestXmin for heap and toast table, otherwise, the heap can
+	 * have RECENTLY_DEAD row that contains toast pointers whose toast rows
+	 * have already been vacuumed away.  This is because OldestXmin can move
+	 * backward on repeated calls. See GetOldestXmin.
+	 */
+	if (OidIsValid(toast_relid))
+	{
+		if (!TransactionIdIsValid(oldestXmin))
+			oldestXmin =
+				TransactionIdLimitedForOldSnapshots(GetOldestXmin(onerel, PROCARRAY_FLAGS_VACUUM), onerel);
+	}
+
+	/*
 	 * Switch to the table owner's userid, so that any index functions are run
 	 * as that user.  Also lock down security-restricted operations and
 	 * arrange to make GUC variable changes local to this command. (This is
@@ -1554,7 +1570,7 @@ vacuum_rel(Oid relid, RangeVar *relation, int options, VacuumParams *params)
 					(options & VACOPT_VERBOSE) != 0);
 	}
 	else
-		lazy_vacuum_rel(onerel, options, params, vac_strategy);
+		lazy_vacuum_rel(onerel, options, params, vac_strategy, oldestXmin);
 
 	/* Roll back any GUC changes executed by index functions */
 	AtEOXact_GUC(false, save_nestlevel);
@@ -1580,7 +1596,7 @@ vacuum_rel(Oid relid, RangeVar *relation, int options, VacuumParams *params)
 	 * totally unimportant for toast relations.
 	 */
 	if (toast_relid != InvalidOid)
-		vacuum_rel(toast_relid, NULL, options, params);
+		vacuum_rel(toast_relid, NULL, options, params, oldestXmin);
 
 	/*
 	 * Now release the session-level lock on the master table.
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 5649a70..9331089 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -188,7 +188,7 @@ static bool heap_page_is_all_visible(Relation rel, Buffer buf,
  */
 void
 lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
-				BufferAccessStrategy bstrategy)
+				BufferAccessStrategy bstrategy, TransactionId oldestXmin)
 {
 	LVRelStats *vacrelstats;
 	Relation   *Irel;
@@ -233,8 +233,9 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
 						  params->freeze_table_age,
 						  params->multixact_freeze_min_age,
 						  params->multixact_freeze_table_age,
-						  &OldestXmin, &FreezeLimit, &xidFullScanLimit,
+						  &oldestXmin, &FreezeLimit, &xidFullScanLimit,
 						  &MultiXactCutoff, &mxactFullScanLimit);
+	OldestXmin = oldestXmin;
 
 	/*
 	 * We request an aggressive scan if the table's frozen Xid is now older
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 85d472f..38a3c60 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -188,7 +188,8 @@ extern void vacuum_delay_point(void);
 
 /* in commands/vacuumlazy.c */
 extern void lazy_vacuum_rel(Relation onerel, int options,
-				VacuumParams *params, BufferAccessStrategy bstrategy);
+				VacuumParams *params, BufferAccessStrategy bstrategy,
+				TransactionId oldestXmin);
 
 /* in commands/analyze.c */
 extern void analyze_rel(Oid relid, RangeVar *relation, int options,

#10

Andrew Gierth

andrew@tao11.riddles.org.uk

over 7 years ago

In reply to: Amit Kapila (#9)

Re: Toast issues with OldestXmin going backwards

"Amit" == Amit Kapila <amit.kapila16@gmail.com> writes:

I don't see how this could be arranged without either disabling the
ability to vacuum the toast table independently, or storing the xmin
somewhere.

Amit> I think we can use the same xmin for both main heap and toast

But you're completely missing the point here, which is that the main
heap might not be vacuumed AT ALL. Autovacuum can and will call vacuum()
with the relid of a toast table - this is a deliberate decision to cope
with access patterns that would otherwise bloat the toast table.

(It's also currently legal to do VACUUM pg_toast.pg_toast_nnnn;
manually, but that's of secondary importance compared to the fact that
autovacuum does it.)

Amit> by computing it before scanning the main heap (lazy_vacuum_rel)
Amit> and then pass it down. I have tried it and came up with the
Amit> attached patch.

Your patch would actually be needed if (and only if) autovacuum was
changed back to its old behavior of never vacuuming toast tables
independently, and if manual VACUUM pg_toast.*; was disabled. But in the
presence of either of those two possibilities, it does nothing useful.

Toast pointers don't point to TIDs fortunately, they point to OIDs.
The OID could have been reused (if there's been an OID wraparound
since the toast item was created), but that should be harmless: it
means that we'd incorrectly copy the new value with the old tuple,
but the old tuple is never going to be visible to anybody ever again
so nothing will see that.

Amit> Yeah, that's right, but it gives some uneasy feeling that we are
Amit> attaching the wrong toast value. If you think there is some basic
Amit> flaw in Approach-1 (as in attached patch) or it requires some
Amit> major surgery then we can look further into this.

The basic flaw is that it doesn't even reach the problem.

--
Andrew (irc:RhodiumToad)

#11

Andrew Gierth

andrew@tao11.riddles.org.uk

over 7 years ago

In reply to: Amit Kapila (#9)

Re: Toast issues with OldestXmin going backwards

"Amit" == Amit Kapila <amit.kapila16@gmail.com> writes:

Amit> Yeah, that's right, but it gives some uneasy feeling that we are
Amit> attaching the wrong toast value.

Oh, forgot to stress this before: referencing the wrong toast value in
this case is something that can already happen, my patch does not affect
either the probability of it occurring or the consequences of it.

--
Andrew (irc:RhodiumToad)

#12

Andrew Gierth

andrew@tao11.riddles.org.uk

over 7 years ago

In reply to: Andrew Gierth (#7)

Re: Toast issues with OldestXmin going backwards

"Andrew" == Andrew Gierth <andrew@tao11.riddles.org.uk> writes:

Andrew> Since the tuple was vacuumable at that point, it will never
Andrew> again become visible in any snapshot, so the only code that
Andrew> should ever access its toast values is vacuum full/cluster,

and, unfortunately, CREATE INDEX (non-concurrently), including the index
build done by vacfull/cluster itself, which rather messes things up
because the toasted column might be a parameter to a functional index, or it
might itself be an indexed column (within a narrow size range for btree,
or more generally for other index types); this means the index AM or the
indexed function may try and actually detoast and inspect the value.

So right now it's possible (I think) in the worst case to crash the
server if you can set up exactly the right circumstances (re-use of a
prematurely vacuumed toast oid after wraparound, with the new value
matching the old value's raw length but not its compression state).

(My changes so far don't address this case and don't make it either more
or less likely)

--
Andrew (irc:RhodiumToad)

#13

Amit Kapila

amit.kapila16@gmail.com

over 7 years ago

In reply to: Andrew Gierth (#10)

Re: Toast issues with OldestXmin going backwards

On Mon, Apr 23, 2018 at 1:55 AM, Andrew Gierth
<andrew@tao11.riddles.org.uk> wrote:

"Amit" == Amit Kapila <amit.kapila16@gmail.com> writes:

Amit> by computing it before scanning the main heap (lazy_vacuum_rel)
Amit> and then pass it down. I have tried it and came up with the
Amit> attached patch.

Your patch would actually be needed if (and only if) autovacuum was
changed back to its old behavior of never vacuuming toast tables
independently, and if manual VACUUM pg_toast.*; was disabled. But in the
presence of either of those two possibilities, it does nothing useful.

Yeah, right, I have missed the point that they can be vacuumed
separately, however, I think that decision is somewhat questionable.
Basically, it doesn't appear logical to leave the sort-of dangling
toast pointer in heap. I think this is somewhat akin to our current
index and heap binding where we never let heap itemid to go off till
all index entries pointing to it are taken care of. I have gone
through the commit (3ccde312ec8ee47f5f797b070d34a675799448ae) which
appears to have decoupled this mechanism, but it doesn't provide any
reason for doing so. I have also gone through the related thread [1]/messages/by-id/20080808165809.GB3800@alvh.no-ip.org
which is not much help either. I think there must be some patterns
where toast table bloat causes much more harm than heap as toast rows
are much larger in size. Are you aware of what were the concrete
reasons behind this change? I think it would have been better if
along with decoupling of vacuum for main heap and toast tables, we
would have come up with a way to selectively remove the corresponding
rows from the main heap, say by just vacuuming heap pages/rows which
have toast pointers but maybe that is not viable or involves much more
work without equivalent benefit.

Also, we can think along the lines of another idea suggested by Andres
[2]: /messages/by-id/20130204164341.GB22226@awork2.anarazel.de

[1]: /messages/by-id/20080808165809.GB3800@alvh.no-ip.org
[2]: /messages/by-id/20130204164341.GB22226@awork2.anarazel.de

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#14

Andrew Gierth

andrew@tao11.riddles.org.uk

over 7 years ago

In reply to: Amit Kapila (#13)

Re: Toast issues with OldestXmin going backwards

"Amit" == Amit Kapila <amit.kapila16@gmail.com> writes:

Your patch would actually be needed if (and only if) autovacuum was
changed back to its old behavior of never vacuuming toast tables
independently, and if manual VACUUM pg_toast.*; was disabled. But in
the presence of either of those two possibilities, it does nothing
useful.

Amit> Yeah, right, I have missed the point that they can be vacuumed
Amit> separately, however, I think that decision is somewhat
Amit> questionable.

Some previous discussion links for reference, for the background to the
thread containing the patch:

/messages/by-id/87y7gpiqx3.fsf@oxford.xeocode.com
/messages/by-id/20080608230348.GD11028@alvh.no-ip.org

Amit> I think it would have been better if along with decoupling of
Amit> vacuum for main heap and toast tables, we would have come up with
Amit> a way to selectively remove the corresponding rows from the main
Amit> heap, say by just vacuuming heap pages/rows which have toast
Amit> pointers but maybe that is not viable or involves much more work
Amit> without equivalent benefit.

It should be fairly obvious why this is unworkable - most toast-using
tables will have toast pointers on every page, but without making a
whole new index of toast pointer OIDs (unacceptable overhead), it would
be impossible to find the toast pointers pointing to a specific item
without searching the whole rel (in which case we might just as well
have vacuumed it).

Amit> Also, we can think along the lines of another idea suggested by
Amit> Andres [2] on the thread mentioned by you.

That one is tricky for various reasons (locking, order of operations in
vacuum_rel, having to mess with the API of vacuum(), etc.)

--
Andrew (irc:RhodiumToad)

#15

Amit Kapila

amit.kapila16@gmail.com

over 7 years ago

In reply to: Andrew Gierth (#14)

Re: Toast issues with OldestXmin going backwards

On Mon, Apr 23, 2018 at 1:34 PM, Andrew Gierth
<andrew@tao11.riddles.org.uk> wrote:

"Amit" == Amit Kapila <amit.kapila16@gmail.com> writes:

Your patch would actually be needed if (and only if) autovacuum was
changed back to its old behavior of never vacuuming toast tables
independently, and if manual VACUUM pg_toast.*; was disabled. But in
the presence of either of those two possibilities, it does nothing
useful.

Amit> Yeah, right, I have missed the point that they can be vacuumed
Amit> separately, however, I think that decision is somewhat
Amit> questionable.

Some previous discussion links for reference, for the background to the
thread containing the patch:

/messages/by-id/87y7gpiqx3.fsf@oxford.xeocode.com
/messages/by-id/20080608230348.GD11028@alvh.no-ip.org

If I read correctly, it seems one of the main reason [1]From one of the email: "We go certain lengths in autovacuum to make sure tables are vacuumed when their toast table needs vacuuming and the main table does not, which is all quite kludgy. So we already look at their stats and make decisions about them. But what we do after that is force a vacuum to the main table, even if that one does not need any vacuuming, which is dumb." is to save
the extra pass over the heap and improve the code. Now, ideally, the
extra pass over heap should also free up some space (occupied by the
rows that contains old toast pointers corresponding to which we are
going to remove rows from toast table), but it is quite possible that
it is already clean due to a previous separate vacuum pass over the
heap. I think it is good to save extra pass over heap which might not
be as useful as we expect, but that can cost us correctness issues in
boundary cases as in the case being discussed in this thread.

Amit> I think it would have been better if along with decoupling of
Amit> vacuum for main heap and toast tables, we would have come up with
Amit> a way to selectively remove the corresponding rows from the main
Amit> heap, say by just vacuuming heap pages/rows which have toast
Amit> pointers but maybe that is not viable or involves much more work
Amit> without equivalent benefit.

It should be fairly obvious why this is unworkable - most toast-using
tables will have toast pointers on every page, but without making a
whole new index of toast pointer OIDs (unacceptable overhead), it would
be impossible to find the toast pointers pointing to a specific item
without searching the whole rel (in which case we might just as well
have vacuumed it).

Okay, such an optimization might not be much effective and it would
anyway lead to extra pass over the heap, however, that will resolve
the correctness issue. Now, I understand that it is not advisable to
go back to the previous behavior for performance concerns, but I think
it would be better if we find a bullet-proof way to fix this symptom,
rather than just fixing the issue reported in this thread.

[1]: From one of the email: "We go certain lengths in autovacuum to make sure tables are vacuumed when their toast table needs vacuuming and the main table does not, which is all quite kludgy. So we already look at their stats and make decisions about them. But what we do after that is force a vacuum to the main table, even if that one does not need any vacuuming, which is dumb."
make sure tables are vacuumed when their toast table needs vacuuming
and the main table does not, which is all quite kludgy. So we already
look at their stats and make decisions about them. But what we do
after that is force a vacuum to the main table, even if that one does
not need any vacuuming, which is dumb."

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#16

Peter Eisentraut

peter.eisentraut@2ndquadrant.com

over 7 years ago

In reply to: Amit Kapila (#13)

Re: Toast issues with OldestXmin going backwards

On 4/23/18 00:33, Amit Kapila wrote:

Yeah, right, I have missed the point that they can be vacuumed
separately, however, I think that decision is somewhat questionable.

Manually vacuuming the TOAST table was a way to repair the recently
fixed TOAST bug, so it's kind of useful.

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#17

Robert Haas

robertmhaas@gmail.com

over 7 years ago

In reply to: Amit Kapila (#15)

Re: Toast issues with OldestXmin going backwards

On Mon, Apr 23, 2018 at 11:21 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

If I read correctly, it seems one of the main reason [1] is to save
the extra pass over the heap and improve the code. Now, ideally, the
extra pass over heap should also free up some space (occupied by the
rows that contains old toast pointers corresponding to which we are
going to remove rows from toast table), but it is quite possible that
it is already clean due to a previous separate vacuum pass over the
heap. I think it is good to save extra pass over heap which might not
be as useful as we expect, but that can cost us correctness issues in
boundary cases as in the case being discussed in this thread.

I don't think anybody disagrees with that, but it doesn't answer the
question of how best to fix it. Making VACUUM a lot more expensive
will not win us any awards, and an extra pass over the heap could be
very expensive. You seem to think Andrew's fix isn't really
addressing the root of the problem, but I think that just depends on
how you define the problem. If you define the problem as "the table
should never have dangling pointers to the TOAST table", then Andrew's
approach is only fixing the symptom. But if you define the problem as
"we shouldn't try to follow TOAST pointers in heap tuples that are not
visible to any current or future MVCC snapshot", then it's fixing the
root issue.

I wonder if perhaps get_actual_variable_range() has a hazard of this
type as well. If OldestXmin retreats, then the special snapshot type
which it uses could see a row whose TOAST entry has been removed
meanwhile.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#18

Tom Lane

tgl@sss.pgh.pa.us

over 7 years ago

In reply to: Robert Haas (#17)

Re: Toast issues with OldestXmin going backwards

Robert Haas <robertmhaas@gmail.com> writes:

I wonder if perhaps get_actual_variable_range() has a hazard of this
type as well. If OldestXmin retreats, then the special snapshot type
which it uses could see a row whose TOAST entry has been removed
meanwhile.

Hm, yeah. I wonder if we could fix that by extracting the value from
the index rather than recomputing it from the heap tuple, as we do now.
We'd still have to visit the heap tuple to check liveness, but we'd
not need to extract anything from its data part.

Have we given up on the angle of "prevent OldestXmin from retreating"?
That seems like it'd be far more bulletproof than trying to work
around places that break one-by-one.

regards, tom lane

#19

Michael Paquier

michael@paquier.xyz

over 7 years ago

In reply to: Tom Lane (#18)

Re: Toast issues with OldestXmin going backwards

On Thu, Apr 26, 2018 at 03:09:01PM -0400, Tom Lane wrote:

Have we given up on the angle of "prevent OldestXmin from retreating"?
That seems like it'd be far more bulletproof than trying to work
around places that break one-by-one.

Strong +1 on that.
--
Michael

#20

Andrew Gierth

andrew@tao11.riddles.org.uk

over 7 years ago

In reply to: Tom Lane (#18)

Re: Toast issues with OldestXmin going backwards

"Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes:

Tom> Have we given up on the angle of "prevent OldestXmin from
Tom> retreating"?

I haven't, and I'm thinking about it, but I don't have an immediate
solution.

Thinking-out-loud mode:

1) we could store minimum values for OldestXmin for each database in
some kind of shared-memory structure. Downside is that we can't
easily make this fixed-size, since we don't know how many DBs there
are - we can't size it based on connections since we need to be able
to track values for dbs with no currently active connections.

(Or do we need to track it across restarts? maybe we do, to deal with
replication slaves without slots, or changes in parameters)

2) in-place updates of a new column in pg_database? and a control file
field for the global values? not back-patchable, but would it work
going forward?

3) there was the previous suggestion to store it per-table in the main
table's reloptions and inplace-update that; this isn't impossible,
but it would be fiddly to do because toast-table-only vacuums would
need to locate the main table, and lock order issues might get
interesting. (Maybe it would be sneakier to store it in the toast
table's otherwise-unused reloptions, and have vacuum/create index on
the main table consult that? This assumes that we only care about the
issue when dealing with toast tables)

4) something else??

--
Andrew (irc:RhodiumToad)

#21

Andrew Gierth

andrew@tao11.riddles.org.uk

over 7 years ago

In reply to: Robert Haas (#17)

Re: Toast issues with OldestXmin going backwards

"Robert" == Robert Haas <robertmhaas@gmail.com> writes:

Robert> I wonder if perhaps get_actual_variable_range() has a hazard of
Robert> this type as well. If OldestXmin retreats, then the special
Robert> snapshot type which it uses could see a row whose TOAST entry
Robert> has been removed meanwhile.

Actually get_actual_variable_range may have a hazard even if OldestXmin
does not retreat; it uses RecentGlobalXmin, which is quite likely to be
older than an OldestXmin value used by a vacuum.

--
Andrew (irc:RhodiumToad)

#22

Amit Kapila

amit.kapila16@gmail.com

over 7 years ago

In reply to: Robert Haas (#17)

Re: Toast issues with OldestXmin going backwards

On Fri, Apr 27, 2018 at 12:18 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Apr 23, 2018 at 11:21 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

If I read correctly, it seems one of the main reason [1] is to save
the extra pass over the heap and improve the code. Now, ideally, the
extra pass over heap should also free up some space (occupied by the
rows that contains old toast pointers corresponding to which we are
going to remove rows from toast table), but it is quite possible that
it is already clean due to a previous separate vacuum pass over the
heap. I think it is good to save extra pass over heap which might not
be as useful as we expect, but that can cost us correctness issues in
boundary cases as in the case being discussed in this thread.

I don't think anybody disagrees with that, but it doesn't answer the
question of how best to fix it. Making VACUUM a lot more expensive
will not win us any awards, and an extra pass over the heap could be
very expensive. You seem to think Andrew's fix isn't really
addressing the root of the problem, but I think that just depends on
how you define the problem. If you define the problem as "the table
should never have dangling pointers to the TOAST table", then Andrew's
approach is only fixing the symptom. But if you define the problem as
"we shouldn't try to follow TOAST pointers in heap tuples that are not
visible to any current or future MVCC snapshot", then it's fixing the
root issue.

Hmm, in some rare cases when OID will be reused, it can still lead to
problems (when Create Index is performed in parallel to Cluster) as
mentioned by Andrew. I am not sure if that is acceptable on the
grounds that it is not a newly introduced problem by his proposed
patch. I think if we can come up with a solution which prevents
OldestXmin from retreating, it will be good, otherwise, we might have
to think harder whether there is any other solution to this problem
which will attack the root cause of the problem and if nothing better
comes out, then probably we can go with his proposed patch.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#23

Robert Haas

robertmhaas@gmail.com

over 7 years ago

In reply to: Andrew Gierth (#20)

Re: Toast issues with OldestXmin going backwards

On Thu, Apr 26, 2018 at 9:03 PM, Andrew Gierth
<andrew@tao11.riddles.org.uk> wrote:

(Or do we need to track it across restarts? maybe we do, to deal with
replication slaves without slots, or changes in parameters)

Yeah, I'm worried that it might need to be persistent across restarts.

One idea that occurred to me is to somehow record -- I guess in
pg_class using non-transactional updates -- the last cutoff XID used
to vacuum any given table. Then we could just make a rule that you
can't vacuum the TOAST table with an XID that's newer than the last
one used for the main table. That would preserve the property that
you can vacuum the tables separately while avoiding dangling pointers.
But that's obviously not back-patchable, and it sounds finicky to get
right. It's really too bad that heap tables don't include a metapage
where we could store details like this...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#24

Andrew Gierth

andrew@tao11.riddles.org.uk

over 7 years ago

In reply to: Robert Haas (#23)

Re: Toast issues with OldestXmin going backwards

"Robert" == Robert Haas <robertmhaas@gmail.com> writes:

Robert> One idea that occurred to me is to somehow record -- I guess in
Robert> pg_class using non-transactional updates -- the last cutoff XID
Robert> used to vacuum any given table. Then we could just make a rule
Robert> that you can't vacuum the TOAST table with an XID that's newer
Robert> than the last one used for the main table. That would preserve
Robert> the property that you can vacuum the tables separately while
Robert> avoiding dangling pointers. But that's obviously not
Robert> back-patchable,

The suggestion made previously (in a historical thread) was to use an
entry in the reloptions field for this, at least in back branches. It
would be necessary for vacuum to add the entry initially in a normal
transactional update, after which it could be updated inplace.

--
Andrew (irc:RhodiumToad)

#25

Robert Haas

robertmhaas@gmail.com

over 7 years ago

In reply to: Andrew Gierth (#24)

Re: Toast issues with OldestXmin going backwards

On Fri, Apr 27, 2018 at 11:35 AM, Andrew Gierth
<andrew@tao11.riddles.org.uk> wrote:

"Robert" == Robert Haas <robertmhaas@gmail.com> writes:

Robert> One idea that occurred to me is to somehow record -- I guess in
Robert> pg_class using non-transactional updates -- the last cutoff XID
Robert> used to vacuum any given table. Then we could just make a rule
Robert> that you can't vacuum the TOAST table with an XID that's newer
Robert> than the last one used for the main table. That would preserve
Robert> the property that you can vacuum the tables separately while
Robert> avoiding dangling pointers. But that's obviously not
Robert> back-patchable,

The suggestion made previously (in a historical thread) was to use an
entry in the reloptions field for this, at least in back branches. It
would be necessary for vacuum to add the entry initially in a normal
transactional update, after which it could be updated inplace.

Yeah, I suppose. Sounds pretty rickety to me, though. Maybe I'm just
a pessimist.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#26

Stephen Frost

sfrost@snowman.net

over 7 years ago

In reply to: Robert Haas (#25)

Re: Toast issues with OldestXmin going backwards

Greetings,

* Robert Haas (robertmhaas@gmail.com) wrote:

On Fri, Apr 27, 2018 at 11:35 AM, Andrew Gierth
<andrew@tao11.riddles.org.uk> wrote:

"Robert" == Robert Haas <robertmhaas@gmail.com> writes:

Robert> One idea that occurred to me is to somehow record -- I guess in
Robert> pg_class using non-transactional updates -- the last cutoff XID
Robert> used to vacuum any given table. Then we could just make a rule
Robert> that you can't vacuum the TOAST table with an XID that's newer
Robert> than the last one used for the main table. That would preserve
Robert> the property that you can vacuum the tables separately while
Robert> avoiding dangling pointers. But that's obviously not
Robert> back-patchable,

The suggestion made previously (in a historical thread) was to use an
entry in the reloptions field for this, at least in back branches. It
would be necessary for vacuum to add the entry initially in a normal
transactional update, after which it could be updated inplace.

Yeah, I suppose. Sounds pretty rickety to me, though. Maybe I'm just
a pessimist.

I tend to agree.. However, this isn't something that's been happening a
lot, from what I gather, and if we actually add a proper column into
pg_class for future versions (not really sure how I feel about if that
means "v11" or "v12" right now...) and reloptions for back-branches then
perhaps it's not so bad.

As far as a metadata page, it'd be pretty overkill, but maybe a fork for
it..? I'm trying to think if there might be anything else we'd be able
to put into such a fork since adding another inode to every relation
that'll only ever likely be 8k definitely wouldn't win us any fans. Not
sure, just brainstorming.

Thanks!

Stephen

#27

Andrew Gierth

andrew@tao11.riddles.org.uk

over 7 years ago

In reply to: Stephen Frost (#26)

Re: Toast issues with OldestXmin going backwards

"Stephen" == Stephen Frost <sfrost@snowman.net> writes:

Stephen> I tend to agree.. However, this isn't something that's been
Stephen> happening a lot, from what I gather,

Right now I can't prove it happens in the wild at all - I think I've
ruled it out as an explanation for the original problem report - but
it's very easy to demonstrate in tests.

--
Andrew (irc:RhodiumToad)

#28

Amit Kapila

amit.kapila16@gmail.com

over 7 years ago

In reply to: Robert Haas (#23)

Re: Toast issues with OldestXmin going backwards

On Fri, Apr 27, 2018 at 8:33 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Apr 26, 2018 at 9:03 PM, Andrew Gierth
<andrew@tao11.riddles.org.uk> wrote:

(Or do we need to track it across restarts? maybe we do, to deal with
replication slaves without slots, or changes in parameters)

Yeah, I'm worried that it might need to be persistent across restarts.

One idea that occurred to me is to somehow record -- I guess in
pg_class using non-transactional updates -- the last cutoff XID used
to vacuum any given table. Then we could just make a rule that you
can't vacuum the TOAST table with an XID that's newer than the last
one used for the main table. That would preserve the property that
you can vacuum the tables separately while avoiding dangling pointers.

Won't this lead to a bloat in toast tables when there is a big
difference between the cutoff XID of the main heap table and the
latest values of OldestXmin? As both the tables can be vacuumed
separately, it doesn't sound impossible, but maybe in some later pass
on main heap and toast table will fix the situation. Even if the
difference is not big, vacuum might still be not able to remove rows
from toast table in some situations.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#29

Andrew Gierth

andrew@tao11.riddles.org.uk

over 7 years ago

In reply to: Amit Kapila (#28)

Re: Toast issues with OldestXmin going backwards

"Amit" == Amit Kapila <amit.kapila16@gmail.com> writes:

(Or do we need to track it across restarts? maybe we do, to deal with
replication slaves without slots, or changes in parameters)

Yeah, I'm worried that it might need to be persistent across restarts.

One idea that occurred to me is to somehow record -- I guess in
pg_class using non-transactional updates -- the last cutoff XID used
to vacuum any given table. Then we could just make a rule that you
can't vacuum the TOAST table with an XID that's newer than the last
one used for the main table. That would preserve the property that
you can vacuum the tables separately while avoiding dangling pointers.

Amit> Won't this lead to a bloat in toast tables when there is a big
Amit> difference between the cutoff XID of the main heap table and the
Amit> latest values of OldestXmin?

Yes. What we need is actually the reverse of what Robert describes -
when we vacuum the _main_ table, we must use the _later_ of the
currently calculated OldestXmin or the OldestXmin last used to vacuum
the toast table.

--
Andrew (irc:RhodiumToad)

#30

Amit Kapila

amit.kapila16@gmail.com

over 7 years ago

In reply to: Andrew Gierth (#29)

Re: Toast issues with OldestXmin going backwards

On Sun, Apr 29, 2018 at 11:56 AM, Andrew Gierth
<andrew@tao11.riddles.org.uk> wrote:

"Amit" == Amit Kapila <amit.kapila16@gmail.com> writes:

(Or do we need to track it across restarts? maybe we do, to deal with
replication slaves without slots, or changes in parameters)

Yeah, I'm worried that it might need to be persistent across restarts.

One idea that occurred to me is to somehow record -- I guess in
pg_class using non-transactional updates -- the last cutoff XID used
to vacuum any given table. Then we could just make a rule that you
can't vacuum the TOAST table with an XID that's newer than the last
one used for the main table. That would preserve the property that
you can vacuum the tables separately while avoiding dangling pointers.

Amit> Won't this lead to a bloat in toast tables when there is a big
Amit> difference between the cutoff XID of the main heap table and the
Amit> latest values of OldestXmin?

Yes. What we need is actually the reverse of what Robert describes -
when we vacuum the _main_ table, we must use the _later_ of the
currently calculated OldestXmin or the OldestXmin last used to vacuum
the toast table.

I think then that same formula needs to be used during cluster as
well. Also what about get_actual_variable_range(), will also need
similar change, is it okay to add additional lookup of pg_class in
that code path?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com